Skip to main content

Architecture

JuiceFS consists of three parts:

JuiceFS-arch

JuiceFS client

JuiceFS adopts the so-called "rich client" design, making the JuiceFS Client the most important part of the system: all file I/O happens in JuiceFS Client, this even includes background jobs like data compaction and trash file expiration. So obviously, JuiceFS Client talk to both object storage and metadata service.

A variety of implementations are supported:

  • FUSE mount point: JuiceFS file system can be mounted in a POSIX-compatible manner, allowing the elastic cloud storage to be used locally
  • Hadoop Java SDK: JuiceFS can replace HDFS and provide massive storage for Hadoop at a significantly lower cost
  • Kubernetes CSI 驱动: use JuiceFS CSI Driver in Kubernetes to provide shared storage for containers
  • S3 Gateway: expose S3 API, so that applications using S3 as the storage layer can directly access JuiceFS file system

Data storage

File data will be split into chunks and stored in object storage (more on this in How JuiceFS stores files), you can use object storage provided by public cloud services, or self-hosted, JuiceFS supports virtually all types of object storage, including typical self-hosted ones like OpenStack Swift, Ceph, and MinIO.

Metadata engine

High performance metadata engine, developed in-house by Juicedata. Metadata Engine stores file metadata, which contains:

  • Common file system metadata: file name, size, permission information, creation and modification time, directory structure, file attribute, symbolic link, file lock.
  • JuiceFS specific metadata: file inode, chunk and slice mapping, client session, etc.

Metadata service is already deployed in most public cloud regions so that you may use right outside the box. As Cloud Service users, you will be using metadata service via public internet (in the same region, for optimal performance), but if you are using JuiceFS in larger scale and require even better access latency, contact JuiceFS team to bring private internet support via VPC peering.

Metadata service is a high availability service built on top of Raft protocol, where all metadata operations are appended as Raft logs. A Raft group typically consists of 3 nodes, one leader and two followers, all maintaining consistency through Raft algorithm, achieving strong consistency and HA.

single-zone

A Raft group forms a JuiceFS Metadata Zone. Currently a single zone can handle around 200 million inodes, for larger data scale, use our multi-zone solution (only available via on-premise deployment). Multi-zone clusters scale horizontally by simply adding more zones.

multi-zone

A JuiceFS file system backed by multi-zone Metadata Service can handle tens fo billions of files, in which every zone assumes the same architecture as a single Raft group. Multiple zones can be deployed on a single node, and the number of zones can be dynamically adjusted (manually or balanced automatically), this effectively avoid performance issues caused by data hot spots. All relevant functionalities is integrated into our Web Console for easy management and monitoring. For JuiceFS Client, accessing a multi-zone Metadata Service is just the same as single zone scenario, it'll be able to write to different zones at the same time. When cluster topology changes, client will get notified and adapt automatically.

How JuiceFS stores files

Traditional file systems use local disks to store both file data and metadata. However, JuiceFS formats data first and then stores it in the object storage, with the corresponding metadata being stored in the metadata engine.

In JuiceFS, each file is composed of one or more chunks. Each chunk has a maximum size of 64 MB. Regardless of the file's size, all reads and writes are located based on their offsets (the position in the file where the read or write operation occurs) to the corresponding chunk. This design enables JuiceFS to achieve excellent performance even with large files. As long as the total length of the file remains unchanged, the chunk division of the file remains fixed, regardless of how many modifications or writes the file undergoes.

File and chunks

Chunks exist to optimize lookup and positioning, while the actual file writing is performed on slices. In JuiceFS, each slice represents a single continuous write, belongs to a specific chunk, and cannot overlap between adjacent chunks. This ensures that the slice length never exceeds 64 MB.

For example, if a file is generated through a continuous sequential write, each chunk contains only one slice. The figure above illustrates this scenario: a 160 MB file is sequentially written, resulting in three chunks, each containing only one slice.

File writing generates slices, and invoking flush persists these slices. flush can be explicitly called by the user, and even if not invoked, the JuiceFS client automatically performs flush at the appropriate time to prevent buffer overflow (refer to Read/write buffer). When persisting to the object storage, slices are further split into individual blocks (default maximum size of 4 MB) to enable multi-threaded concurrent writes, thereby enhancing write performance. The previously mentioned chunks and slices are logical data structures, while blocks represent the final physical storage form and serve as the smallest storage unit for the object storage and disk cache.

Split slices to blocks

After writing a file to JuiceFS, you cannot find the original file directly in the object storage. Instead, the storage bucket contains a chunks folder and a series of numbered directories and files. These numerically named object storage files are the blocks split and stored by JuiceFS. The mapping between these blocks, chunks, slices, and other metadata information (such as file names and sizes) is stored in the metadata engine. This decoupled design makes JuiceFS a high-performance file system.

How JuiceFS stores files

Regarding logical data structures, if a file is not generated through continuous sequential writes but through multiple append writes, each append write triggers a flush to initiate the upload, resulting in multiple slices. If the data size for each append write is less than 4 MB, the data blocks eventually stored in the object storage are smaller than 4 MB blocks.

Small append writes

Depending on the writing pattern, the arrangement of slices can be diverse:

  • If a file is repeatedly modified in the same part, it results in multiple overlapping slices.
  • If writes occur in non-overlapping parts, there will be gaps between slices.

However complex the arrangement of slices may be, when reading a file, the most recent written slice is read for each file position. The figure below illustrates this concept: while slices may overlap, reading the file always occurs "from top to bottom." This ensures that you see the latest state of the file.

Complicate pattern

Due to the potential overlapping of slices, JuiceFS marks the valid data offset range for each slice (read Community Edition docs) for internal implementation, JuiceFS Community Edition and Enterprise Edition adopt the same design, in this regard) in the reference relationship between chunks and slices. This approach informs the file system of the valid data in each slice.

However, it is not difficult to imagine that looking up the "most recently written slice within the current read range" during file reading, especially with a large number of overlapping slices as shown in the figure, can significantly impact read performance. This leads to what we call "file fragmentation." File fragmentation not only affects read performance but also increases space usage at various levels (object storage, metadata). Hence, whenever a write occurs, the metadata service evaluates the file's fragmentation and schedule compaction as background jobs, merging all slices within the same chunk into one.

File fragmentation compaction

Some other technical aspects of JuiceFS storage design:

  • In order to avoid read amplification, files are not merged and stored, with one exception under intensive small file scenarios: when a lot of small files are being created, they'll be merged, stored in batch blocks to increase write performance. They will be split during later compaction.
  • Provides strong consistency guarantee, but can be tuned for performance in different scenarios, e.g. deliberately adjust metadata cache policies, to trade consistency for performance. Learn more at "Metadata cache".
  • Support "Trash" functionality, and enabled by default. Deleted files are kept for a specified amount of time, to help you avoid data loss caused by accidental deletion.