Cache
To improve the performance, JuiceFS supports caching in multiple levels to reduce the latency and increase throughput, including metadata cache, data cache and cache sharing among multiple clients (also known as cache group).
Data caching will effectively improve the performance of random reads. For applications that require high random read performance (e.g. Elasticsearch, ClickHouse), it is recommended to use faster storage medium and allocate more space for cache.
Meanwhile, cache improve performance only when application needs to repeatedly read files, so if you know for sure you're in a scenario where data is only accessed once (e.g. data cleansing during ETL), you can safely turn off cache to prevent overhead.
Consistency
JuiceFS provides a "close-to-open" consistency guarantee. When multiple clients are using the same file at the same time, changes introduced by A may not be immediately visible to others before close, but once the file is closed by A, all clients will see latest change if they reopen the file afterwards, no matter they are on the same host with A or not.
"Close-to-open" is the minimum consistency guarantee provided by JuiceFS, in some cases it may not be necessary to reopen the file to access the latest written data:
- When multiple applications access a same file using the same JuiceFS client, changes are immediately visible to all parties.
- View the latest data on different nodes with the
tail -f
command (require use Linux)
As for object storage, JuiceFS divides files into data blocks (default 4MiB), assigns unique IDs and saves them on object storage. Any modification operation of the file will generate a new data block, and the original block remains unchanged, including the cached data on the local disk. So don't worry about the consistency of the data cache, because once the file is modified, JuiceFS will read the new data block from the object storage, and will not read the data block (which will be deleted later) corresponding to the overwritten part of the file.
Metadata Cache
JuiceFS caches metadata in the kernel and memory of client to improve the performance.
Metadata Cache in Kernel
Three kinds of metadata can be cached in kernel: attribute, entry (file) and direntry (directory). Timeout is configurable through the following parameters:
--attrcacheto=ATTRCACHETO
File attribute cache timeout in seconds, default to 1
--entrycacheto=ENTRYCACHETO
File entry cache timeout in seconds, default to 1
--direntrycacheto=DIRENTRYCACHETO
Directory entry cache timeout in seconds, default to 1
Attribute, entry and direntries are cached for 1 second by default, this speeds up lookup and getattr operations. When clients on multiple nodes are sharing the same file system, the metadata cached in kernel will only expire by TTL. So in edge cases, it's possible that metadata modifications (e.g., chown) on node A cannot be seen immediately on node B. Nevertheless, all nodes will eventually be able to see the changes made by A after cache expiration.
Metadata Cache in Client Memory
In some cases that the client needs to list and query the metadata service frequently, client can cache all the dentryies in visited directory in memory, which can be enabled by the following parameter (enabled by default):
--metacache cache metadata in client memory
When enabled, the visited directory are cached in client memory for 5 minutes (adjustable via the --metacacheto
option). The cache will be invalidated upon any change to the directory to guarantee consistency. The meta cache will improve performance for operations such as lookup
, getattr
, access
, open
and etc. The JuiceFS client caches metadata for up to 500,000 directories by default, which can be adjusted with the --max-cached-inodes
option.
Besides, client will also cache symbolic links. Since symbolic links are immutable (new symlink is created, overwriting the existing one), the cached content will never expire.
To maintain strong data consistency, the open
system call will access metadata directly from metadata service, bypassing local metadata cache. But for read-only scenarios (e.g. model training), use --opencache
to enable cache for open
, further improving read performance. This feature is turned off by default.
Data Cache
To improve performance, JuiceFS also provides various caching mechanisms for data, including page cache in the kernel, local file system cache in client host, and read/write buffer in client process itself. Read requests will try the kernel page cache, the client process buffer, and the local disk cache in turn. If the data requested is not found in any level of the cache, it will be read from the object storage, and also be written into every level of the cache asynchronously to improve the performance of the next access.
Read/Write Buffer
Mount parameter --buffer-size
controls the Read/Write buffer size for JuiceFS Client, which defaults to 300 (in MiB). Buffer size dictates both the memory data size for file read (and readahead), and memory data size for writing pending pages. Naturally, we recommend increasing --buffer-size
when under high concurrency, to effectively improve performance.
If you wish to improve write speed, and have already increased --max-uploads
for more upload concurrency, with no noticeable increase in upload traffic, consider also increasing --buffer-size
so that concurrent threads may easier allocate memory for data uploads. This also works in the opposite direction: if tuning up --buffer-size
didn't bring out an increase in upload traffic, you should probably increase --max-uploads
as well.
The --buffer-size
also controls the data upload size for each flush
operation, this means for clients working in a low bandwidth environment, you may need to use a lower --buffer-size
to avoid flush
timeouts. Refer to "Connection problems with object storage" for troubleshooting under low internet speed.
You can check Read/Write buffer usage in JuiceFS Prometheus API, metric name is juicefs_totalBufferUsed
and juicefs_readBufferUsed
. But depending on the actual scenario, Read/Write buffer may always be fully utilized regardless of your optimization effort, so you should really adjust this settings according to the actual situation, rather than relying on this metric.
Kernel Page Cache
Kernel will build page cache for recently visited files. When files are reopened, data will be fetched directly from page cache to achieve the best performance.
JuiceFS Metadata Service tracks a list of recently opened files. If one file is opened repeatedly by the same client, metadata service will inform the client whether the kernel page cache is valid by checking the modification history, if file is already modified, all relevant page cache is invalidated on the next open, this ensures that the client can always read the latest data.
Reading the same file in JuiceFS repeatedly will be extremely fast, with milliseconds latency and gigabytes throughput.
Writeback Cache in Kernel
Starting from Linux kernel 3.15, FUSE supports writeback-cache mode, the kernel will combines high-frequency random small IO (e.g. 10-100 bytes) write requests to significantly improve the performance of random writes.
Enable writeback-cache mode by using the -o writeback_cache
option when running juicefs mount
. Note that the FUSE writeback-cache mode is different from Write Cache in Client, the former is a Kernel implementation, while the latter happens in the JuiceFS Client, read the corresponding section to learn their intended scenarios.
Read Cache in Client
The client will perform prefetch and cache automatically to improve sequence read performance according to the read mode in the application. Data will be cached in local file system, which can be any local storage device like HDD, SSD or even memory.
Data downloaded from object storage, as well as small data (smaller than a single block) uploaded to object storage will be cached by JuiceFS Client, without compression or encryption. To achieve better performance on application's first read, use juicefs warmup
to cache data in advance.
If the file system where the cache directory is located is not working properly, the JuiceFS client can immediately return an error and downgrade to direct access to object storage. This is usually true for local disk, but if the file system where the cache directory is located is abnormal and the read operation is stuck (such as some kernel-mode network file system), then JuiceFS will also get stuck together. This requires you to tune the underlying file system behavior of the cache directory to fail fast.
Below are some important parameters for cache config (see juicefs mount
for complete reference):
--prefetch
Prefetch N blocks concurrently (default to 1), prefetch mechanism works like this: when reading a block at arbitrary position, the whole block is asynchronously scheduled for download. Prefetch often improves random read performance, but if your scenario cannot effectively utilize prefetched data (for example, reading large files randomly and sparsely), prefetch will bring read amplification, consider set to 0 to disable it. Please read "Read amplification" for more information.
JuiceFS is equipped with another internal similar mechanism called "readahead": when doing sequential reads, client will download nearby blocks in advance, improving sequential performance. The concurrency of readahead is affected by the size of "Read/Write Buffer", the larger the read-write buffer, the higher the concurrency.
--cache-dir
Cache directory, default to
/var/jfsCache
, supports multiple values separated by:
and wildcard character*
, which is mostly used by dedicated cache node which has multiple storage devices mounted. Also,/dev/shm
is supported if you'd like to use memory for cache. Another way to use memory for cache is to pass the stringmemory
to this parameter, this puts cache directly in client process memory, which is simpler compared to/dev/shm
, but obviously cache will be lost after process restart, use this for tests and evaluations.It is recommended to use an independent disk for the cache directory as much as possible, do not use the system disk, and do not share it with other applications. Sharing not only affects the performance of each other, but may also cause errors in other applications (such as insufficient disk space left). If it is unavoidable to share, you must estimate the disk capacity required by other applications, limit the size of the cache space (see below for details), and avoid JuiceFS's read cache or write cache takes up too much space.
When multiple cache directories are set, the
--cache-size
option represents the total size of data in all cache directories. JuiceFS balance cache writes with a simple hash algorithm, and cannot tailor write strategies for different cache directories, thus it is recommended that the free space of different cache directories are close to each other, otherwise, the space of a cache directory may not be fully utilized. This is also true when multiple disks are used as cache devices.For example,
--cache-dir
is/data1:/data2
, where/data1
has a free space of 1GiB,/data2
has a free space of 2GiB,--cache-size
is 3GiB,--free-space-ratio
is 0.1. Because the cache write strategy is to write evenly, the maximum space allocated to each cache directory is3GiB / 2 = 1.5GiB
, resulting in a maximum of 1.5GiB cache space in the/data2
directory instead of2GiB * 0.9 = 1.8GiB
.If you are in urgent need to free up disk space, you can manually delete data under the cache directory, which is usually
/var/jfsCache/<vol-name>/raw/
.--cache-size
and--free-space-ratio
Cache size (in MiB, default to 102400) and minimum ratio of free size (default 0.2). Both parameters is able to control cache size, if any of the two criteria is met, JuiceFS Client will reduce cache usage in a LRU manner.
Actual cache size may exceed configured value, because it is difficult to calculate the exact disk space taken by cache. Currently, JuiceFS takes the sum of all cached objects sizes using a minimum 4 KiB size, which is often different from the result of
du
.--cache-partial-only
Only cache small files and random small reads smaller than block size, like the name suggests, full block will not be cached. This option is usually used when object storage throughput is higher than the local cache device, or when you only need to cache small files and random reads. Default value is false.
There are two main read patterns, sequential read and random read. Sequential read usually demands higher throughput while random reads needs lower latency. When local disk throughput is lower than object storage, consider enable
--cache-partial-only
so that sequential reads do not cache the whole block, but rather, only small reads (like Parquet / ORC footer) are cached. This allows JuiceFS to take advantage of low latency provided by local disk, and high throughput provided by object storage, at the same time.Another feature which impacts random read performance is compression (same also applies for encryption). When compression is enabled, the block will always be fetched and cached from object storage, no matter how small the read actually is. This hinders first time read performance, so it's recommended that you disable compression for JuiceFS when the system does intensive random reads. Without compression, JuiceFS can do partial reads on a block, this improves read delay and reduces bandwidth consumption.
Write Cache in Client
Enabling client write cache can improve performance when writing large amount of small files. Read this section to learn about client write cache.
Client write cache is disabled by default, data writes will be held in the read/write buffer (in memory), and is uploaded to object storage when a chunk is filled full, or forced by application with close()
/fsync()
calls. To ensure data security, client will not commit file writes to the Metadata Service until data is uploaded to object storage.
You can see how the default "upload first, then commit" write process will not perform well when writing large amount of small files. After the client write cache is enabled, the write process becomes "commit first, then upload asynchronously", file writes will not be blocked by data uploads, instead it will be written to the local cache directory and committed to the metadata service, and then returned immediately. The file data in the cache directory will be asynchronously uploaded to the object storage in the background.
Add --writeback
to the mount command to enable client write cache, but this mode comes with some risks and caveats:
- Disk reliability is crucial to data integrity, if write cache data suffers loss before upload is complete, file data is lost forever. Use with caution when data reliability is critical.
- Write cache data by default is stored in
/var/jfsCache/<vol-name>/rawstaging/
, do not delete files under this directory or data will be lost. - Write cache size is controlled by
--free-space-ratio
. By default, if the write cache is not enabled, the JuiceFS client uses up to 80% of the disk space of the cache directory (the calculation rule is(1 - <free-space-ratio>) * 100
). After the write cache is enabled, a certain percentage of disk space will be overused. The calculation rule is(1 - (<free-space-ratio> / 2)) * 100
, that is, by default, up to 90% of the disk space of the cache directory will be used. - Write cache and read cache share cache disk space, so they affect each other. For example, if the write cache takes up too much disk space, the size of the read cache will be limited, and vice versa.
- If local disk write speed is lower than object storage upload speed, enabling
--writeback
will only result in worse write performance. - If the file system of the cache directory raises error, client will fallback and write synchronously to object storage, which is the same behavior as Read Cache in Client.
- If object storage upload speed is too slow (low bandwidth), local write cache can take forever to upload, meanwhile reads from other nodes will result in timeout error (I/O error). See Connection problems with object storage.
- Data writes do not directly contribute to cache group, access from clients on other nodes will have to download from object storage. Continue reading to learn about cache group.
Improper usage of client write cache can easily cause problems, that's why only recommend to temporarily enable this when writing large number of small files (e.g. extracting a compressed file containing a large number of small files). Adjusting mount options are also very easy: JuiceFS Client supports seamless remount, just add the needed parameters, and run the mount command again, see Seamless remount.
Distributed cache
By default, JuiceFS cache can only be accessed by the mounted host, but when lots of JuiceFS clients need to access a same dataset repeatedly, you can enable the distributed cache functionality, and share local cache to all clients, this significantly improves performance. Distributed cache is best suited for deep learning model training based on GPU cluster. By caching the training dataset into memory or SSD on all nodes, JuiceFS provides high-performance data access capabilities, so that the GPU will not be idle because it is too slow to read data.
To enable distributed cache, when you mount JuiceFS, set --cache-group
to a desired group name. Clients within the same cache group will listen on a random port, and share cache with its LAN peers, using metadata service as service discovery. You can also additionally specify the --fill-group-cache
option (disabled by default), so that data blocks uploaded to object storage are also sent directly to the cache group, so that every write operation contributes to cache.
Member nodes within a cache group form a consistent hashing ring, where every cached data block store location is calculated using consistent hashing, when client initiate a read request, data block is obtained directly from the member node (if data is not yet cached, will download directly from object storage, and store on that member node). When a node joins or leaves the cache group, data will be migrated to neighbor nodes on the hashing ring (in real world implementation, migration will happen 10 minutes after the cache group member change, to prevent unwanted fluctuations). Thus, member changes in a cache group only affect a small part of cached data. JuiceFS also implements virtual nodes to ensure load balance, preventing data hot spots being created by data migrations.
Some important considerations when building JuiceFS distributed cache, which also applies to building a dedicated cache cluster (introduced in the next section):
- Use a homogeneous set of nodes (at least same disk size for cache storage) for cache group, because the consistent hashing algorithm currently used by JuiceFS assumes all nodes have the same cache storage size. Expect waste on disk usage if you were to use nodes with different disk size for JuiceFS cache group.
- Cache group members are connected using 10 Gigabit Ethernet, if multiple network interfaces are installed, use
--group-ip
to specify the network device with larger bandwidth. - Cache group members need to ensure high-speed access to the object storage service, low internet speed will result in a poorer use experience.
Dedicated Cache Cluster
In a distributed cache group, all clients contribute to cache, but if the cache group is within a cluster that's constantly scaling up and down (like in a Kubernetes cluster), cache utilization is not ideal because cache nodes are frequently destroyed and re-created. In this case, consider building a dedicated cache cluster so that the computation cluster have access to a more stable cache. Set up the dedicated cache cluster according to "Distributed Cache", make sure --cache-group
is the same between cache cluster and computation cluster, and then add --no-sharing
to computing cluster. Here's what you should know on --no-sharing
:
- Nodes with
--no-sharing
will access cached data built by the cache group, but they do not service cache to peers. --fill-group-cache
behavior is not affected by--no-sharing
. If a--no-sharing
node is also mounted with--fill-group-cache
, when file is written on that node, data blocks are also sent to the cache group, contributing to cache.- If
--writeback
is enabled,--fill-group-cache
will cease to work. - If you'd like to maximize data utilization, while saving disk space on client nodes, consider disabling local cache for client nodes, so that all data requests is served by cache cluster / object storage, set
--cache-size
to 0 to disable local cache.
At this point, there will be three levels of cache: the system cache of the computing node, the disk cache of the computing node, and the cache cluster nodes (including their system cache). The cache media and cache size at each level can be configured according to the access characteristics of the specific application.
When you need to access a fixed data set, you can use juicefs warmup
to warm up the data set in advance to improve first time read performance. You don't necessarily need to run juicefs warmup
within the dedicated cache cluster, when executed on a client with --no-sharing
enabled, cache cluster will be warmed up all the same.
To build a dedicated cache cluster for Kubernetes cluster, refer to CSI Driver: Dedicated cache cluster.
Cache Optimization
Always pay close attention to cache related metrics when doing optimizations, you can download our Grafana Dashboard which already includes many crucial charts, refer to Monitoring for details.
For single node, check these metrics inside our dashboard:
- Block Cache Hit Ratio
- Block Cache Count
- Block Cache Size
- Other Block Cache related metrics in Grafana Dashboard
For cache clusters, check these metrics so you know when to scale up and down:
- When the cache cluster capacity is sufficient, most of the client's requests should be served by the cache cluster, and a small number of requests may be sent to object storage. Observe this using metrics like Block Cache Hit Ratio or Block Cache Size, both are already provided within our Grafana Dashboard.
- Low cache group storage storage will cause block eviction, observe this using
juicefs_blockcache_evict
and other relevant metrics. See Metrics for more.