Skip to main content

Troubleshooting Cases

Debugging process for some frequently encountered JuiceFS problems.

Failed to mount JuiceFS / mount hangs

If JuiceFS mount failed, try to find clues in the output messages, as JuiceFS Client will expose most errors, for example, incorrect AK/SK will cause failure during creating object storage bucket.

For unexpected situations, we recommend you to first upgrade JuiceFS Client to the latest version, to effectively isolates old bugs:

sudo juicefs version --upgrade

If problem persists after upgrade, add --foreground --verbose to the mount command to run JuiceFS Client in foreground, with more debug logs:

sudo juicefs mount $VOL_NAME /jfs --foreground --verbose

For example, log may indicate a DNS failure, i.e. JuiceFS Client cannot resolve metadata server domain name, in this case, confirm that the upstream DNS server is working and flush the local DNS cache.

If the domain name can be correctly resolved, but unable to connect, it is likely that the firewall has restricted port access, making JuiceFS Client unable to connect to the metadata server. Make sure that the host allows outbound traffic to 9300 - 9500 TCP port, which is required to communicate with JuiceFS metadata service.

If you are using the local iptables firewall, execute the following command to allow outbound traffic for 9300 - 9500 TCP port:

sudo iptables -A OUTPUT -p tcp --match multiport --dports 9300:9500 -j ACCEPT

If you are using security group of public cloud services, adjust accordingly in their console.

Read / Write error

There's many possible causes for I/O error, but you can always find useful clues in client logs (default to /var/log/juicefs.log, see mount parameters for details), this section covers some of the more frequently seen problems and troubleshooting steps.

Connection problems with Metadata service

JuiceFS Client needs to maintain TCP connections with Metadata cluster in order provide fast metadata service for the file system, if connection problems occur, logs will be filled with these type of errors:

<ERROR>: request 402 (40) timeout after 1.000603709s
<ERROR>: request 428 (3283870) failed after tried 30 times

The request 402 and request 428 in the above logs are the RPC command codes used by the client to communicate with the metadata service, most of which belong to the metadata operation of the file system. If there are a large number of such RPC request timeout errors in the log, you can use the following command to troubleshoot:

# find Metadata address in established TCP connections
lsof -iTCP -sTCP:ESTABLISHED | grep jfsmount
# if JuiceFS is already unmounted, there'll be no TCP connections, find Metadata master address in config file
grep master ~/.juicefs/$VOL_NAME.conf

# verify access to Metadata service port, assuming Metadata address being aliyun-bj-1.meta.juicefs.com
telnet aliyun-bj-1.meta.juicefs.com 9402
# if telnet is not available, can also use curl / wget
# but Metadata server does not speak in HTTP, it can only respond empty reply
curl aliyun-bj-1.meta.juicefs.com:9402
# curl: (52) Empty reply from server

# if metadata DNS address cannot be resolved, verify that client host can access master_ip
# obtain master_ip in config file
grep master_ip -A 3 ~/.juicefs/$VOL_NAME.conf
# verify access towards master_ip using above telnet or curl commands

# if metadata DNS address can be resolved, verify results match master_ip inside config file
dig aliyun-bj-1.meta.juicefs.com

Above troubleshooting effort usually leads to these problems:

Unable to resolve DNS address, in JuiceFS Cloud Service, Metadata DNS names are public, DNS resolve failure in client host usually indicates DNS setting issues, try to fix it first.

On the other hand, even if DNS address are unavailable, client can work properly as long as it can access Metadata service via IP. That's why with on-premise deployments, JuiceFS doesn't necessarily need a DNS address, as they exist primarily for disaster recovery. If you don't plan to migrate Metadata to another host, it's safe to skip actually setting up DNS name resolution for Metadata, however this will result in some WARNING logs, just ignore them. If you decide to setup DNS resolution, be sure to use correct IPs, clients will run into fatal error if DNS name are available, but points to wrong IP addresses.

Connection problems with master_ip, in JuiceFS Cloud Service, client usually connect to Metadata service via public IP, to troubleshoot connection problems with master_ip:

  • If curl, ping fails with timeouts, usually there's problems with security group settings, thoroughly check and avoid firewall issues
  • If probing Metadata service port (default to 9402) results in Connection Refused, this means client has network access to destination IP, but Metadata Service isn't running (usually happens in on-premise deployment)
  • In on-premise deployment, Metadata DNS names are managed internally, check using above troubleshooting commands to verify if DNS name resolution results are the same as master_ip in config file, if the results don't match, fix DNS address resolution or re-mount JuiceFS Client according to the actual situation

Connection problems with object storage (slow internet speed)

If JuiceFS Client cannot connect to object storage service, or the bandwidth is simply not enough, JuiceFS will complain in logs:

# upload speed is slow
<INFO>: slow request: PUT chunks/1986/1986377/1986377131_11_4194304 (%!s(<nil>), 20.512s)

# error uploading to object storage may be accompanied with golang stacktraces, notice the function names in question and safely assume it's an upload error
<ERROR>: flush 9902558 timeout after waited 8m0s [writedata.go:604]
<ERROR>: pending slice 9902558-80: {chd:0xc0007125c0 off:0 chunkid:1986377183 cleng:12128803 soff:0 slen:12128803 writer:0xc010a92240 freezed:true done:false status:0 notify:0xc010e57d70 started:{wall:13891666526241100901 ext:5140404761832 loc:0x35177c0} lastMod:{wall:13891666526250536970 ext:5140414197911 loc:0x35177c0}} [writedata.go:607]
<WARNING>: All goroutines (718):
goroutine 14275 [running]:
jfs/mount/fs.(*inodewdata).flush(0xc004ec6fc0, {0x7fbc06385918, 0xc00a08c140}, 0x0?)
/p8s/root/jfs/mount/fs/writedata.go:611 +0x545
jfs/mount/fs.(*inodewdata).Flush(0xc0007ba1e0?, {0x7fbc06385918?, 0xc00a08c140?})
/p8s/root/jfs/mount/fs/writedata.go:632 +0x25
jfs/mount/vfs.Flush({0x2487e98?, 0xc00a08c140}, 0x9719de, 0x8, 0x488c0e?)
/p8s/root/jfs/mount/vfs/vfs.go:1099 +0x2c3
jfs/mount/fuse.(*JFS).Flush(0x1901a65?, 0xc0020cc1b0?, 0xc00834e3d8)
/p8s/root/jfs/mount/fuse/fuse.go:348 +0x8e
...
goroutine 26277 [chan send, 9 minutes]:
jfs/mount/chunk.(*wChunk).asyncUpload(0xc005f5a600, {0x0?, 0x0?}, {0xc00f849f50, 0x28}, 0xc0101e55e0, {0xc010caac80, 0x4d})
/p8s/root/jfs/mount/chunk/cached_store.go:531 +0x2e5
created by jfs/mount/chunk.(*wChunk).upload.func1
/p8s/root/jfs/mount/chunk/cached_store.go:615 +0x30c
...

If the problem is a network connection issue - solve the problem and make sure JuiceFS Client can reliably use object storage service. But if the error was caused by low bandwidth, there's some more to consider.

The first issue with slow connection is upload / download timeouts (demonstrated in the above error logs), to tackle this problem:

  • Reduce upload concurrency, e.g. --max-uploads=1, to avoid upload timeouts.
  • Reduce buffer size, e.g. --buffer-size=64 or even lower. In a large bandwidth condition, increasing buffer size improves parallel performance. But in a low speed environment, this only makes flush operations slow and prone to timeouts.
  • Default timeout for GET / PUT requests are 60 seconds, increasing --get-timeout and --put-timeout may help with read / write timeouts.

In addition, the "Client Write Cache" feature needs to be used with caution in low bandwidth environment. Let's briefly go over the JuiceFS Client background job design: Every JuiceFS Client runs background jobs by default, one of which is data compaction, and if the client has poor internet speed, it'll drag down performance for the whole system. A worse case is when client write cache is also enabled, compaction results are uploaded too slowly, forcing other clients into a read hang when accessing the affected files:

# while compaction results are slowly being uploaded in low speed clients, read from other clients will hang and eventually fail
<WARNING>: readworker: unexpected data block size (requested: 4194304 / received: 0)
<ERROR>: read for inode 0:14029704 failed after tried 30 times
<ERROR>: read file 14029704: input/output error
<INFO>: slow operation: read (14029704,131072,0): input/output error (0) <74.147891>
<WARNING>: fail to read chunkid 1771585458 (off:4194304, size:4194304, clen: 37746372): get chunks/1771/1771585/1771585458_1_4194304: oss: service returned error: StatusCode=404, ErrorCode=NoSuchKey, ErrorMessage="The specified key does not exist.", RequestId=62E8FB058C0B5C3134CB80B6
<WARNING>: readworker: unexpected data block size (requested: 4194304 / received: 0)
<WARNING>: fail to read chunkid 1771585458 (off:0, size:4194304, clen: 37746372): get chunks/1771/1771585/1771585458_0_4194304: oss: service returned error: StatusCode=404, ErrorCode=NoSuchKey, ErrorMessage="The specified key does not exist.", RequestId=62E8FB05AC30323537AD735D

To avoid this type of read hang, we recommend that you disable background jobs for low bandwidth clients: go to the "Access Control" page in JuiceFS console, create a new Access Token, and disable background job. Background jobs will not be scheduled when mounting with this token.

Read amplification

In JuiceFS, a typical read amplification manifests as object storage traffic being much larger than JuiceFS Client read speed. For example, JuiceFS Client is reading at 200MiB/s, while S3 traffic grows up to 2GB/s.

JuiceFS is equipped with a prefetch mechanism: when reading a block at arbitrary position, the whole block is asynchronously scheduled for download. This is a read optimization enabled by default, but in some cases, this brings read amplification. Once we know this, we can start the diagnose.

We'll collect JuiceFS access log (see Access log) to determine the file system access patterns of our application, and adjust JuiceFS configuration accordingly. Below is a diagnose process in an actual production environment:

# Collect access log for a period of time, like 30s:
cat /jfs/.oplog | grep -v "^#$" >> op.log

# Simple analysis using wc / grep finds out that most operations are read:
wc -l op.log
grep "read (" op.log | wc -l

# Pick a file and track operation history using its inode (first argument of read):
grep "read (148153116," op.log

Access log looks like:

2022.09.22 08:55:21.013121 [uid:0,gid:0,pid:0] read (148153116,131072,28668010496,19235): OK (131072) <1.309992>
2022.09.22 08:55:21.577944 [uid:0,gid:0,pid:0] read (148153116,131072,14342746112,19235): OK (131072) <1.385073>
2022.09.22 08:55:22.098133 [uid:0,gid:0,pid:0] read (148153116,131072,35781816320,19235): OK (131072) <1.301371>
2022.09.22 08:55:22.883285 [uid:0,gid:0,pid:0] read (148153116,131072,3570397184,19235): OK (131072) <1.305064>
2022.09.22 08:55:23.362654 [uid:0,gid:0,pid:0] read (148153116,131072,100420673536,19235): OK (131072) <1.264290>
2022.09.22 08:55:24.068733 [uid:0,gid:0,pid:0] read (148153116,131072,48602152960,19235): OK (131072) <1.185206>
2022.09.22 08:55:25.351035 [uid:0,gid:0,pid:0] read (148153116,131072,60529270784,19235): OK (131072) <1.282066>
2022.09.22 08:55:26.631518 [uid:0,gid:0,pid:0] read (148153116,131072,4255297536,19235): OK (131072) <1.280236>
2022.09.22 08:55:27.724882 [uid:0,gid:0,pid:0] read (148153116,131072,715698176,19235): OK (131072) <1.093108>
2022.09.22 08:55:31.049944 [uid:0,gid:0,pid:0] read (148153116,131072,8233349120,19233): OK (131072) <1.020763>
2022.09.22 08:55:32.055613 [uid:0,gid:0,pid:0] read (148153116,131072,119523176448,19233): OK (131072) <1.005430>
2022.09.22 08:55:32.056935 [uid:0,gid:0,pid:0] read (148153116,131072,44287774720,19233): OK (131072) <0.001099>
2022.09.22 08:55:33.045164 [uid:0,gid:0,pid:0] read (148153116,131072,1323794432,19233): OK (131072) <0.988074>
2022.09.22 08:55:36.502687 [uid:0,gid:0,pid:0] read (148153116,131072,47760637952,19235): OK (131072) <1.184290>
2022.09.22 08:55:38.525879 [uid:0,gid:0,pid:0] read (148153116,131072,53434183680,19203): OK (131072) <0.096732>

Studying the access log, it's easy to conclude that our application performs frequent random small reads on a very large file, notice how the offset (the third argument of read) jumps significantly between each read, this means consecutive reads are accessing very different parts of the large file, thus prefetched data blocks is not being effectively utilized (a block is 4MiB by default, an offset of 4194304 bytes), only causing read amplifications. In this situation, we can safely set --prefetch to 0, so that prefetch concurrency is zero, which is essentially disabled. Re-mount and our problem is solved.

High memory usage

If JuiceFS Client takes up too much memory, you may choose to optimize memory usage using below methods, but note that memory optimization is not free, and each setting adjustment will bring corresponding overhead, please do sufficient testing and verification before adjustment.

  • Read/Write buffer size (--buffer-size) directly correlate to JuiceFS Client memory usage, using a lower --buffer-size will effectively decrease memory usage, but please note that the reduction may also affect the read and write performance. Read more at Read/Write Buffer.
  • JuiceFS mount client is an Go program, which means you can decrease GOGC (default to 100, in percentage) to adopt a move active garbage collection. This inevitably increase CPU usage and may even directly hinder performance. Read more in Go Runtime.
  • If you use self-hosted Ceph RADOS as the data storage of JuiceFS, consider replacing glibc with TCMalloc, the latter comes with more efficient memory management and may decrease off-heap memory footprint in this scenario.