JuiceFS consists of three parts:
All file I/O happens in JuiceFS Client, this even includes background jobs like data compaction and trash file expiration. So obviously, JuiceFS Client talk to both object storage and metadata service. A variety of implementations are supported:
- FUSE mount point: JuiceFS file system can be mounted in a POSIX-compatible manner, allowing the elastic cloud storage to be used locally
- Hadoop Java SDK: JuiceFS can replace HDFS and provide massive storage for Hadoop at a significantly lower cost
- Kubernetes CSI 驱动: use JuiceFS CSI Driver in Kubernetes to provide shared storage for containers
- S3 Gateway: expose S3 API, so that applications using S3 as the storage layer can directly access JuiceFS file system
Client program distribution is achieved through a Python script (except for Hadoop SDK), and the actual workload is handled by another Go binary. The installation command for our cloud service actually just download this Python script:
curl -L https://juicefs.com/static/juicefs -o /usr/local/bin/juicefs && chmod +x /usr/local/bin/juicefs
# Take a preview to learn it's just a Python script compatible with bash
Run any sub-command and this Python script will automatically detect environment and download the appropriate Go binary, and use this binary to carry out actual workload:
# Go binary will be downloaded on the first run
$ juicefs version
JuiceFS version 5.0.2 (2023-11-07 48bfaf2)
# The default installation path for the Go binary
$ find /root/.juicefs -executable
After a JuiceFS file system is mounted (if you do not yet know how, read getting started), you'll see two process on the host:
# Mount the file system to /jfs
$ juicefs mount myjfs /jfs
$ ps -ef | grep juicefs
# The daemon process, which manages the actual mount process
root 6000 1 7 15:11 ? 00:00:00 /usr/local/bin/juicefs mount myjfs /jfs
# The mount process
root 959605 6000 12 15:11 ? 00:00:00 /root/.juicefs/jfsmount mount myjfs /jfs
The parent process looks over the mount process, if the later crashes, daemon will restart the mount point to ensure service continuity.
Although the parent process shows
/usr/local/bin/juicefs which looks like the initial Python script, but this is not true, but a simple display optimization because the actual process is the Go binary, it modifies the display command to allow users to easily copy the process command and directly use as the mount command.
# Verify the parent process (PID 6000) executable
$ ls -alh /proc/6000/exe
lrwxrwxrwx 1 root root 0 Nov 24 15:59 /proc/6000/exe -> /root/.juicefs/jfsmount
Graceful restart and upgrade
Thanks to the daemon architecture, JuiceFS Client supports dynamically modifying mount options, and then gracefully restart the mount point by passing file handlers. To achieve this, just change the mount options and re-run the mount command:
# Verify current mount point
$ df -h /jfs
Filesystem Size Used Avail Use% Mounted on
JuiceFS:myjfs 1.0T 11G 1014G 1% /jfs
# Modify mount option and remount
# If you've forgotten the mount command, you can always get it back from the ps command demonstrated above
$ juicefs mount myjfs /jfs --buffer-size=300
OK, myjfs is ready at /jfs.
# Check runtime configuration to verify
grep -i buffer /jfs/.config
Apart from graceful restart, JuiceFS mount point can be gracefully upgraded using this one-liner:
juicefs version --upgrade --restart
- If you're still using 4.9 or earlier versions, the daemon process is a Python program, which is different from the latest 5.0 versions. For older clients, Python3 is needed for graceful restart & upgrade.
- Graceful restart doesn't support FUSE option changes, umount & remount is required.
- A graceful restart triggers persistence to object storage, pending data inside the buffer is uploaded to the object storage. In a high load situation, this might not finish in time and client log will read an error message
mount point xxx is busy, stop upgrade.
File data will be split into chunks and stored in object storage (more on this in How JuiceFS stores files), you can use object storage provided by public cloud services, or self-hosted, JuiceFS supports virtually all types of object storage, including typical self-hosted ones like OpenStack Swift, Ceph, and MinIO.
High performance metadata engine, developed in-house by Juicedata. Metadata Engine stores file metadata, which contains:
- Common file system metadata: file name, size, permission information, creation and modification time, directory structure, file attribute, symbolic link, file lock.
- JuiceFS specific metadata: file inode, chunk and slice mapping, client session, etc.
Metadata service is already deployed in most public cloud regions so that you may use right outside the box. As Cloud Service users, you will be using metadata service via public internet (in the same region, for optimal performance), but if you are using JuiceFS in larger scale and require even better access latency, contact JuiceFS team to bring private internet support via VPC peering.
Metadata service is a high availability service built on top of Raft protocol, where all metadata operations are appended as Raft logs. A Raft group typically consists of 3 nodes, one leader and two followers, all maintaining consistency through Raft algorithm, achieving strong consistency and HA.
A Raft group forms a JuiceFS Metadata Zone. Currently a single zone can handle around 200 million inodes, for larger data scale, use our multi-zone solution (only available via on-premise deployment). Multi-zone clusters scale horizontally by simply adding more zones.
A JuiceFS file system backed by multi-zone Metadata Service can handle tens fo billions of files, in which every zone assumes the same architecture as a single Raft group. Multiple zones can be deployed on a single node, and the number of zones can be dynamically adjusted (manually or balanced automatically), this effectively avoid performance issues caused by data hot spots. All relevant functionalities is integrated into our Web Console for easy management and monitoring. For JuiceFS Client, accessing a multi-zone Metadata Service is just the same as single zone scenario, it'll be able to write to different zones at the same time. When cluster topology changes, client will get notified and adapt automatically.
How JuiceFS stores files
Traditional file systems use local disks to store both file data and metadata. However, JuiceFS formats data first and then stores it in the object storage, with the corresponding metadata being stored in the metadata engine.
In JuiceFS, each file is composed of one or more chunks. Each chunk has a maximum size of 64 MB. Regardless of the file's size, all reads and writes are located based on their offsets (the position in the file where the read or write operation occurs) to the corresponding chunk. This design enables JuiceFS to achieve excellent performance even with large files. As long as the total length of the file remains unchanged, the chunk division of the file remains fixed, regardless of how many modifications or writes the file undergoes.
Chunks exist to optimize lookup and positioning, while the actual file writing is performed on slices. In JuiceFS, each slice represents a single continuous write, belongs to a specific chunk, and cannot overlap between adjacent chunks. This ensures that the slice length never exceeds 64 MB.
For example, if a file is generated through a continuous sequential write, each chunk contains only one slice. The figure above illustrates this scenario: a 160 MB file is sequentially written, resulting in three chunks, each containing only one slice.
File writing generates slices, and invoking
flush persists these slices.
flush can be explicitly called by the user, and even if not invoked, the JuiceFS client automatically performs
flush at the appropriate time to prevent buffer overflow (refer to Read/write buffer). When persisting to the object storage, slices are further split into individual blocks (default maximum size of 4 MB) to enable multi-threaded concurrent writes, thereby enhancing write performance. The previously mentioned chunks and slices are logical data structures, while blocks represent the final physical storage form and serve as the smallest storage unit for the object storage and disk cache.
After writing a file to JuiceFS, you cannot find the original file directly in the object storage. Instead, the storage bucket contains a
chunks folder and a series of numbered directories and files. These numerically named object storage files are the blocks split and stored by JuiceFS. The mapping between these blocks, chunks, slices, and other metadata information (such as file names and sizes) is stored in the metadata engine. This decoupled design makes JuiceFS a high-performance file system.
Regarding logical data structures, if a file is not generated through continuous sequential writes but through multiple append writes, each append write triggers a
flush to initiate the upload, resulting in multiple slices. If the data size for each append write is less than 4 MB, the data blocks eventually stored in the object storage are smaller than 4 MB blocks.
Depending on the writing pattern, the arrangement of slices can be diverse:
- If a file is repeatedly modified in the same part, it results in multiple overlapping slices.
- If writes occur in non-overlapping parts, there will be gaps between slices.
However complex the arrangement of slices may be, when reading a file, the most recent written slice is read for each file position. The figure below illustrates this concept: while slices may overlap, reading the file always occurs "from top to bottom." This ensures that you see the latest state of the file.
Due to the potential overlapping of slices, JuiceFS marks the valid data offset range for each slice (read Community Edition docs) for internal implementation, JuiceFS Community Edition and Enterprise Edition adopt the same design, in this regard) in the reference relationship between chunks and slices. This approach informs the file system of the valid data in each slice.
However, it is not difficult to imagine that looking up the "most recently written slice within the current read range" during file reading, especially with a large number of overlapping slices as shown in the figure, can significantly impact read performance. This leads to what we call "file fragmentation." File fragmentation not only affects read performance but also increases space usage at various levels (object storage, metadata). Hence, whenever a write occurs, the metadata service evaluates the file's fragmentation and schedule compaction as background jobs, merging all slices within the same chunk into one.
Some other technical aspects of JuiceFS storage design:
- In order to avoid read amplification, files are not merged and stored, with one exception under intensive small file scenarios: when a lot of small files are being created, they'll be merged, stored in batch blocks to increase write performance. They will be split during later compaction.
- Provides strong consistency guarantee, but can be tuned for performance in different scenarios, e.g. deliberately adjust metadata cache policies, to trade consistency for performance. Learn more at "Metadata cache".
- Support "Trash" functionality, and enabled by default. Deleted files are kept for a specified amount of time, to help you avoid data loss caused by accidental deletion.