Skip to main content

Cache

To improve the performance, JuiceFS supports caching in multiple levels to reduce the latency and increase throughput, including metadata cache, data cache and cache sharing among a group of clients.

Metadata Cache

JuiceFS caches metadata in the kernel and memory of client to improve the performance.

Metadata Cache in Kernel

Three kinds of metadata can be cached in kernel: attribute, entry (file) and direntry (directory). Timeout is configurable through the following parameters:

--attrcacheto=ATTRCACHETO
attribute cache timeout, default 1s
--entrycacheto=ENTRYCACHETO
file entry cache timeout, default 1s
--direntrycacheto=DIRENTRYCACHETO
directory entry cache timeout, default 1s

Attribute, entry and direntries are cached for 1 second by default, to speedup lookup and getattr operations.

Metadata Cache in Client Memory

In some cases that the client needs to list and query the metadata server frequently, client can cache all the dentryies in visited directory in memory, which can be enabled by the following parameter (enabled by default):

--metacache         cache metadata in client memory

When enabled, the visited directory are cached in client memory for 5 minutes (adjustable via the --metacacheto option). The cache will be invalidated upon any change to the directory to guarantee consistency. The meta cache will improve performance for operations such as lookup, getattr, access, open and etc. The JuiceFS client caches metadata for up to 500,000 directories by default, which can be adjusted with the --max-cached-inodes option.

Besides, client will also cache symbolic links. Since symbolic links are immutable (new symlink is created, overwriting the existing one), the cached content will never expire.

To maintain strong data consistency, the open system call will access metadata directly from metadata server, bypassing local metadata cache. But for read-only scenarios (e.g. model training), use --opencache to enable cache for open, further improving read performance. This feature is turned off by default.

Metadata Consistency

If only one client is connected, the cached metadata will be invalidated automatically upon modification. No impact on consistency.

When file is accessed by multiple clients, the only way to invalidate metadata cache in the kernel is waiting for timeout. Metadata cache in client memory is invalidated automatically on modification, but this is an asynchronously operation, therefore in extreme conditions, it is possible that the modification made in client A is not visible to client B in a short time window.

Data Cache

To improve performance, JuiceFS also provides various caching mechanisms for data, including page cache in the kernel and local cache in client host.

Data Cache in Kernel

Kernel will automatically cache content of recently visited files. When the file is reopened, content can be fetched from kernel cache directly for the best performance.

JuiceFS metadata server tracks a list of recently opened files. If one file is opened repeatedly by the same client, metadata server will inform the client whether the kernel cache is valid by checking the modification history, so that the client can always read the latest data.

Reading the same file in JuiceFS repeatedly will be extremely fast, with milliseconds latency and gigabytes throughput.

If write cache in the kernel is not enabled in the current JuiceFS client, all write operations from the application will be passed from FUSE to client directly. Start from Linux kernel 3.15, FUSE supports "writeback-cache mode", which means the write() syscall can often complete very fast. You could enable writeback-cache mode by -o writeback_cache option when running juicefs mount. This mode is recommended on systems with intensive small writes (e.g. 100 bytes).

Read Cache in Client

The client will perform prefetch and cache automatically to improve sequence read performance according to the read mode in the application. Data will be cached in local file system, which can be any local storage device like HDD, SSD or even memory.

In case of cache file system failure, if error can be returned immediately, JuiceFS will automatically fallback and access object storage directly. This is the usual case for local file systems, but with others, file system (such as some kernel-mode network file system) failure may manifest in read hanging, causing JuiceFS reads to hang as well. If this happens, try to adjust settings so that cache file system fails fast.

Data downloaded from, as well as uploaded to object storage will be cached in JuiceFS, without compression or encryption.

Below is some crucial parameters for cache config (see juicefs mount for complete reference):

  • --cache-dir

    Cache directory, default to /var/jfsCache, supports multiple values separated by :, this is mostly used by dedicated cache node which has multiple storage devices mounted. Also, /dev/shm is supported if you'd like to use memory for cache.

    If you are in urgent need to free up disk space, you can manually delete data under the cache directory, which is usually /var/jfsCache/<vol-name>/raw/.

    caution

    If "Write Cache in Client" is enabled, be sure not to delete data under /var/jfsCache/<vol-name>/rawstaging/, or you will lose data.

  • --cache-size and --free-space-ratio

    Cache size (default 100 GiB) and minimum ratio of free size (default 0.2). Both parameters is able to control cache size, if any of the two criteria is met, JuiceFS Client will reduce cache usage in a LRU manner.

    Actual cache size may exceed configured value, because it is difficult to calculate the exact disk space taken by cache. Currently, JuiceFS uses the sum of all cached objects sizes (plus a 4 KiB overhead for each object) as an estimation, which is often different from the result of du.

  • --cache-partial-only

    Only cache small files and random small reads, do not cache whole block. This applies to conditions where object storage throughput is higher than the local cache device.

    There are two main read patterns, sequential read and random read. Sequential read usually demands higher throughput while random reads needs lower latency. When local disk throughput is lower than object storage, consider enable --cache-partial-only so that sequential reads do not cache the whole block, but rather, only small reads (like Parquet / ORC footer) are cached. This allows JuiceFS to take advantage of low latency provided by local disk, and high throughput provided by object storage, at the same time.

Data cache significantly improves random read performance, using better storage device for cache is recommended for applications with intensive random reads, like MySQL, ELasticsearch, ClickHouse.

Another feature which impacts random read performance is compression (same also applies for encryption). When compression is enabled, the block will always be fetched and cached from object storage, no matter how small the read actually is. This hinders first time read performance, so it's recommended that you disable compression for JuiceFS when the system does intensive random reads. Without compression, JuiceFS can do partial reads on a block, this improves read delay and reduces bandwidth consumption.

Data Consistency

JuiceFS will generate a unique key for all data written to object storage, and all objects are immutable, the cache data will never expire. When cache grows over the size limit (or disk full), it will be automatically cleaned up. The current rule is create-time and file size, older and larger file will be cleaned first.

Write Cache in Client

The Client will cache the data written by application in memory. It is flushed to object storage when a chunk is filled full, or forced by application with close()/fsync(), if none of these happens, data is still uploaded after a while. When an application calls fsync() or close(), the client will not return until data is uploaded to object storage and metadata server is notified, ensuring data integrity. Asynchronous uploading (--writeback) may help to improve performance if local storage is reliable. In this case, close() will not be blocked while data is being uploaded to object storage, instead it will return immediately when data is written to local cache directory.

So when there's intensive small writes, it is recommended enable --writeback to improve performance. But this comes with some caveats:

  • With --writeback, the reliability of data write is strongly dependent upon the cache reliability. Use with caution when data reliability is critical.
  • With --writeback, data writes do not directly contribute to cache group, access from clients on other nodes will have to download from object storage. Continue reading to learn about cache group.
  • If cache file system raises error, client will fallback and write synchronously to object storage, which is the same as Read Cache in Client.

For above reason, --writeback is disabled by default.

Group Cache Sharing

By default, JuiceFS cache is shared in host level, i.e. processes on a single host have access to same cache, every host maintain its own cache.

When clients in the same cluster need to access the same dataset repeatedly, e.g. continuous training on the same dataset in machine learning, JuiceFS provides cache sharing to improve the performance. It can be enabled with --cache-group.

When cache group is enabled, JuiceFS Client will listen on a random port, and communicate with its LAN peers, using metadata server as service discovery.

When a client needs to read a specific block, it will try to retrieve from a peer node if the block has already been cached, if not available, the client will instead read directly from object storage save to its own cache. A consistent hashing ring is built among cache group nodes, similar to Memcached, little impact is introduced when clients join or leave the group.

Group cache is best suited for deep learning based on GPU cluster. By caching the training dataset into memory of all nodes, read performance will be so high that it cease to bottleneck GPU.

Independent Cache Cluster

When cache group is created within a cluster that's constantly scaling up and down (Like Spark on Kubernetes), cache utilization is not ideal because cache nodes are frequently destroyed and re-created. In this case, consider building a dedicated cache cluster so that the computation cluster have access to a more stable cache. Set up the dedicated cache cluster according to Group Cache Sharing, make sure --cache-group is the same between cache cluster and computation cluster. Apart from this, add --no-sharing to computation cluster, this forbids computation cluster from forming into the same cache group (only ask for cache, never give).

At this point, there will be three levels of cache: the system cache of the computing node, the disk cache of the computing node, and the cache cluster nodes (including their system cache). The cache media and cache size at each level can be configured according to the access characteristics of the specific application.

When you need to access a fixed data set, you can use juicefs warmup to warm up the data set in advance to improve first time read performance. You don't necessarily need to run juicefs warmup within the dedicated cache cluster, when executed on a client with --no-sharing enabled, cache cluster will be warmed up all the same.

A dedicated cache cluster should be properly monitored, so you know when to scale up and down:

  • When the cache cluster capacity is sufficient, most of the client's requests should be served by the cache cluster, and a small number of requests may be sent to object storage. Observe this using metrics like Block Cache Hit Ratio or Block Cache Size, both are already provided within the Grafana Dashboard, refer to Monitoring for details.
  • Low cache group storage storage will cause block eviction, observe this using blockcache.evict and other relevant metrics. See Metrics for more.