Code-Level Analysis: Design Principles of JuiceFS Metadata and Data Storage

2024-12-12
Arthur

To improve performance, JuiceFS employs a chunking strategy in its data storage process. Concepts like chunk, slice, and block, along with their working mechanisms, may not be easy for new users to grasp. This article, shared by JuiceFS community user Arthur, provides a deep dive into JuiceFS' design principles with code examples, covering both data and metadata storage.

Files written to object storage by JuiceFS

Using MinIO as an example, let’s examine how JuiceFS organizes files in object storage. This structure is similar across other cloud providers like AWS S3. The bucket structure below provides a clear view of how JuiceFS organizes a volume's data in object storage.

In the bucket: One directory per volume

By creating two volumes using the juicefs format command, their organization within the bucket can be observed.

A JuiceFS bucket
A JuiceFS bucket

In the bucket, the top-level "directories" represent JuiceFS volumes. The term "directory" is used here in quotes since object storage is a flat key-value storage system without a true directory structure. To make navigation and understanding easier, object storage simulates directories by grouping keys with common prefixes. For simplicity, the term "directory" will be used without quotes in the following sections.

Each volume’s directory structure: {chunks/, juicefs_uuid, meta/, ...}

Within each volume directory, the structure appears as follows:

 |-chunks/         # Data directory where all user data for the volume resides
  |-juicefs_uuid    
  |-meta/           # `Directory for metadata backups created by `juicefs mount --backup-meta`

juicefs_uuid: Unique volume identifier

The juicefs_uuid file contains the universally unique identifier (UUID) displayed during the juicefs format process, serving as the volume's unique identifier. This UUID is required for volume deletion.

meta/: JuiceFS metadata backup

If the --backup-meta option is specified during the execution of the juicefs mount command, JuiceFS periodically backs up metadata (stored in TiKV) to this directory. The backups serve purposes such as:

  • Restoring metadata when there was a metadata engine failure
  • Migrating metadata across different engines

chunks/

The figure below shows files of a bucket in the MinIO bucket browser:

MinIO bucket browser: files in a bucket
MinIO bucket browser: files in a bucket

The directory structure in chunks/looks like this:

  |-chunks/
  |   |-0/                # <-- id1 = slice_id / 1000 / 1000
  |   |  |-0/             # <-- id2 = slice_id / 1000
  |   |     |-1_0_16      # <-- {slice_id}_{block_id}_{size_of_this_block}
  |   |     |-3_0_4194304 #
  |   |     |-3_1_1048576 #
  |   |     |-...
  |-juicefs_uuid    
  |-meta/

All files in the bucket are named and stored using numbers, organized into three levels.

  • The first level: Purely numeric, derived by dividing slice_id by 1,000,000.
  • The second level: Purely numeric, derived by dividing slice_id by 1,000.
  • The third level: A combination of numbers and an underscore in the format {slice_id}{block_id}{size_of_this_block}, representing the block_id and size of the block within the chunk’s slice. Don’t worry if the concepts of chunk, slice, and block are unclear—we'll explain them shortly.

JuiceFS data design

Top-level division: Files are split into chunks

JuiceFS divides files into chunks, each with a fixed size of 64 MB. This simplifies the process of locating and accessing file contents during read or modification operations. Regardless of file size, all files are split into chunks, varying only in their number.

The figure below shows that each file in JuiceFS is divided into chunks, with a maximum chunk size of 64 MB:

JuiceFS splits each file into their respective chunks
JuiceFS splits each file into their respective chunks

Object storage: No chunk entity Referring to the directory structure in the previous section in object storage:

  |-chunks/
  |   |-0/                # <-- id1 = slice_id / 1000 / 1000
  |   |  |-0/             # <-- id2 = slice_id / 1000
  |   |     |-1_0_16      # <-- {slice_id}_{block_id}_{size_of_this_block}
  |   |     |-3_0_4194304 #
  |   |     |-3_1_1048576 #
  |   |     |-...
  |-juicefs_uuid    
  |-meta/

Chunks in object storage do not correspond to actual files. This means there are no 64 MB chunks stored individually in object storage. In JuiceFS terms, a chunk is a logical concept. If this concept isn't clear yet, don't worry—we'll explain it next.

A continuous write operation within a chunk: Slice

A chunk is merely a "container." Within this container, the data read and written for a file is represented by what JuiceFS calls a "slice." A continuous write within a chunk creates a slice, corresponding to the data written in that segment. Since a slice is a concept within a chunk, it cannot span across chunk boundaries, and its length does not exceed the maximum chunk size of 64 MB. The slice ID is globally unique.

Slice overlap

Depending on the write behavior, a chunk may contain multiple slices. If a file is written in a single, continuous operation, each chunk contains only one slice. However, if the file is written in multiple append operations, with each append triggering a flush, multiple slices are created.

JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.
JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.

For example, consider chunk1:

  • The user writes about 30 MB of data, creating slice5.
  • Next, at about the 20 MB mark, the user begins writing 45 MB (removing part of the original file and appending data).
    • This creates slice6 within chunk1.
    • As the data exceeds chunk1’s boundaries, slice7 and chunk2 are created, because slices cannot cross chunk boundaries.
  • Later, the user begins overwriting from the 10 MB mark of chunk1, creating slice8.

Since slices overlap, several fields are introduced to indicate the valid data range:

type slice struct {
    id    uint64
    size  uint32
    off   uint32
    len   uint32
    pos   uint32
    left  *slice // This field is not stored in TiKV.
    right *slice // This field is not stored in TiKV.
}

Handling multiple slices when reading chunk data: Fragmentation and merging

The figure below shows the relationship between slices and chunks in JuiceFS:

JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.
JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.

JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.

For JuiceFS users, there is only one file, but internally, the corresponding chunk may have multiple overlapping slices. In cases of overlap, the most recent write takes precedence.

Intuitively, in a chunk, when viewed from top to bottom, the parts that have been overwritten are considered invalid. Therefore, when reading a file, the system must search for the latest slice written within the current read range. When many slices overlap, this can significantly impact read performance, a phenomenon known as "file fragmentation."

Fragmentation not only affects read performance but also increases space consumption at the object storage and metadata levels. Whenever a write occurs, the client checks for fragmentation and asynchronously performs fragmentation merging, combining all slices within a chunk.

Object storage: No slice entity

Similar to chunks, slices in object storage do not correspond to actual files.

  |-chunks/
  |   |-0/                # <-- id1 = slice_id / 1000 / 1000
  |   |  |-0/             # <-- id2 = slice_id / 1000
  |   |     |-1_0_16      # <-- {slice_id}_{block_id}_{size_of_this_block}
  |   |     |-3_0_4194304 #
  |   |     |-3_1_1048576 #
  |   |     |-...
  |-juicefs_uuid    
  |-meta/

Slice divided into fixed-size blocks: Concurrent reads and writes in object storage

To accelerate writing to object storage, JuiceFS divides slices into individual blocks (4 MB by default), allowing multi-threaded concurrent writes.

JuiceFS slices consist of blocks. Each block is an object in object storage.
JuiceFS slices consist of blocks. Each block is an object in object storage.

Blocks are the final level in JuiceFS’ data segmentation design and the only level among chunks, slices, and blocks that correspond to actual files in the bucket.

Blocks in a JuiceFS bucket
Blocks in a JuiceFS bucket

The differences between continuous writes and append writes:

  • Continuous writes: The default block size is 4 MB, and the last block is whatever remains.
  • Append writes: If the data is less than 4 MB, the final block stored in object storage is also smaller than 4 MB.

From the example above, the file names and sizes correspond to the following:

  • 1_0_16: Corresponds to file1_1KB. Even though file1_1KB only contains two lines of content, it’s represented in MinIO as two separate objects, each containing one line.
  • 3_0_4194304 + 3_1_1048576: A total of 5 MB, corresponding to file2_5MB.
  • 4_*: Corresponds to file3_129MB.

Object key naming format (and code)

The naming format is:

{volume}/chunks/{id1}/{id2}/{slice_id}_{block_id}_{size_of_this_block}

Corresponding code:

func (s *rSlice) key(blockID int) string {
   if s.store.conf.HashPrefix  // false by default
       return fmt.Sprintf("chunks/%02X/%v/%v_%v_%v", s.id%256, s.id/1000/1000, s.id, blockID, s.blockSize(blockID))


   return fmt.Sprintf("chunks/%v/%v/%v_%v_%v", s.id/1000/1000, s.id/1000, s.id, blockID, s.blockSize(blockID))
} 

Mapping chunks, slices, and blocks to object storage

Finally, we map the data segmentation and organization of the volume to paths and objects in MinIO:

Mapping chunks, slices, and blocks to object storage
Mapping chunks, slices, and blocks to object storage

Summary: Objects storage data and slice/block information alone cannot restore files

At this point, JuiceFS has solved the problem of how data is segmented and stored. This is a forward process: the user creates a file, which is segmented, named, and uploaded to object storage. The corresponding reverse process is: given the objects in object storage, how can we reconstruct them into the user's file? Clearly, just using the slice/block ID information from the object names is insufficient.

For example, in the simplest case, if there is no slice overlap within a chunk, we can reconstruct the file by piecing together slice_id/block_id/block_size information from the object names. However, we still wouldn't know the file’s name, path (parent directory), or permissions (rwx), among other details.

Once chunks have overlapping slices, it becomes impossible to restore the file solely from the information in object storage.

In addition, soft links, hard links, and file attributes cannot be restored from object storage alone.

To solve this reverse process, we need the file’s metadata as auxiliary information—this information was recorded in JuiceFS’ metadata engine before the file was segmented and written to object storage.

JuiceFS metadata design (TKV version)

JuiceFS supports different types of metadata engines, such as Redis, PostgreSQL, TiKV, and etcd. Each type has its own key naming rules. This section discusses the key naming rules when using the transactional key-value (TKV) type of metadata engine, specifically using TiKV.

TKV type key list

Here is a list of JuiceFS’ defined metadata keys, which store key-value pairs in the metadata engine. Note that these keys are different from those used for object storage:

 setting                           format
  C{name}                           counter
  A{8byte-inode}I                   inode attribute
  A{8byte-inode}D{name}             dentry
  A{8byte-inode}P{8byte-inode}      parents // for hard links
  A{8byte-inode}C{4byte-blockID}    file chunks
  A{8byte-inode}S                   symlink target
  A{8byte-inode}X{name}             extented attribute
  D{8byte-inode}{8byte-length}      deleted inodes
  F{8byte-inode}                    Flocks
  P{8byte-inode}                    POSIX locks
  K{8byte-sliceID}{8byte-blockID}   slice refs
  Ltttttttt{8byte-sliceID}          delayed slices
  SE{8byte-sessionID}               session expire time
  SH{8byte-sessionID}               session heartbeat // for legacy client
  SI{8byte-sessionID}               session info
  SS{8byte-sessionID}{8byte-inode}  sustained inode
  U{8byte-inode}                    usage of data length, space and inodes in directory
  N{8byte-inode}                    detached inde
  QD{8byte-inode}                   directory quota
  R{4byte-aclID}                    POSIX acl

In TKV, all integers are stored in encoded binary format:

  • Inode and counter values occupy 8 bytes using little-endian encoding.
  • SessionID, sliceID, and timestamp also occupy 8 bytes but use big-endian encoding.
  • Setting is a special key, where the value contains the configuration information for the volume.

Each key type can be quickly distinguished by its prefix:

  • C: Counter, which includes several subtypes such as:
    • nextChunk
    • nextInode
    • nextSession
  • A: Inode attributes
  • D: Deleted inodes
  • F: Flocks
  • P: POSIX locks
  • S: Session related
  • K: Slice references
  • L: Delayed (to-be-deleted) slices
  • U: Usage of data, space, and inodes within a directory
  • N: Detached inodes
  • QD: Directory quotas
  • R: POSIX ACL

It’s important to note that these key formats are defined by JuiceFS. When the keys and values are written to the metadata engine, the engine may further encode the keys. For example, TiKV might insert additional characters into the keys.

Key/value pairs in the metadata engine

Scanning related TiKV keys

The scan operation in TiKV is similar to the list prefix function in etcd. Here, we scan all keys related to the foo-dev volume:

key: zfoo-dev\375\377A\000\000\000\020\377\377\377\377\177I\000\000\000\000\000\000\371
key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile1_\3771KB\000\000\000\000\000\372
key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile2_\3775MB\000\000\000\000\000\372
...
key: zfoo-dev\375\377SI\000\000\000\000\000\000\377\000\001\000\000\000\000\000\000\371
        default cf value: start_ts: 453485726123950084 value: 7B225665727369...33537387D
key: zfoo-dev\375\377U\001\000\000\000\000\000\000\377\000\000\000\000\000\000\000\000\370
key: zfoo-dev\375\377setting\000\376
        default cf value: start_ts: 453485722598113282 value: 7B0A224E616D65223A202266...0A7D

Decoding JuiceFS metadata keys

You can use the tikv-ctl --decode <key> command to decode the keys. After removing the z prefix, the original JuiceFS keys become clearer:

foo-dev\375A\001\000\000\000\000\000\000\000Dfile1_1KB
foo-dev\375A\001\000\000\000\000\000\000\000Dfile2_5MB
foo-dev\375A\001\000\000\000\000\000\000\000Dfile3_129MB
foo-dev\375A\001\000\000\000\000\000\000\000I
foo-dev\375A\002\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\002\000\000\000\000\000\000\000I
foo-dev\375A\003\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\003\000\000\000\000\000\000\000I
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\001
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\002
foo-dev\375A\004\000\000\000\000\000\000\000I
foo-dev\375ClastCleanupFiles
foo-dev\375ClastCleanupSessions
foo-dev\375ClastCleanupTrash
foo-dev\375CnextChunk
foo-dev\375CnextCleanupSlices
foo-dev\375CnextInode
foo-dev\375CnextSession
foo-dev\375CtotalInodes
foo-dev\375CusedSpace
foo-dev\375SE\000\000\000\000\000\000\000\001
foo-dev\375SI\000\000\000\000\000\000\000\001
foo-dev\375U\001\000\000\000\000\000\000\000
foo-dev\375setting

These decoded keys reveal the metadata of the three created files. Since they are linked with information such as slice_id, this mapping allows them to be associated with data blocks in object storage.

Further decoding based on the key encoding rules can provide more specific details, such as sliceID and inode.

Common challenges with this design

Recovering files from data and metadata

Theoretical steps

For a given JuiceFS file, the previous article explained two forward processes:

  1. The file is split into chunks, slices, and blocks. Then they’re written to object storage.
  2. The file's metadata, including inode, slice, and block information, is organized and written to the metadata engine. With this understanding of the forward process, the reverse process of recovering a file from object storage and the

metadata engine becomes clear:

  1. Scan the metadata engine to gather details such as file name, inode, slice, size, location, and permissions.
  2. Reconstruct the object keys in object storage using slice_id, block_id, and block_size.
  3. Sequentially retrieve the data from object storage using these keys, assemble the file, and write it locally with the appropriate permissions.

Using juicefs info to view file chunk, slice, and block information

JuiceFS provides a command-line option to directly view file chunk, slice, and block details. For example:

foo-dev/file2_5MB :
  inode: 3
  files: 1
   dirs: 0
 length: 5.00 MiB (5242880 Bytes)
   size: 5.00 MiB (5242880 Bytes)
   path: /file2_5MB
 objects:
+------------+--------------------------------+---------+--------+---------+
| chunkIndex |           objectName           |   size  | offset |  length |
+------------+--------------------------------+---------+--------+---------+
|          0 | foo-dev/chunks/0/0/3_0_4194304 | 4194304 |      0 | 4194304 |
|          0 | foo-dev/chunks/0/0/3_1_1048576 | 1048576 |      0 | 1048576 |
+------------+--------------------------------+---------+--------+---------+

This output matches what we observed in MinIO.

Why can’t files written to object storage by JuiceFS be read directly?

The term “cannot be read” here refers to the inability to directly reconstruct the original file from the object storage objects, not the inability to access the objects themselves.

As explained in this article, JuiceFS splits files into chunks, slices, and blocks before storing them in object storage. The data is stored without duplicate content, and no file metadata (such as file names) is included.

Therefore, with the storage format of objects, only the objects themselves can be read, and the original files cannot be reconstructed and presented to the user.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.

Author

Arthur
A JuiceFS user

Related Posts

How to Deploy SeaweedFS+TiKV for Using JuiceFS

2024-07-11
Learn how to deploy SeaweedFS as the object storage for JuiceFS and TiKV for metadata management.

iSEE Lab Stores 500M+ Files on JuiceFS Replacing NFS

2024-07-03
Explore how Sun Yat-sen University's iSEE Lab addressed deep learning data storage challenges using…

How a Distributed File System in Go Reduced Memory Usage by 90%

2024-02-22
Learn JuiceFS Enterprise Edition’s metadata engine design and optimization methods that reduced its…

Building an Easy-to-Operate AI Training Platform: SmartMore's Storage Selection & Best Practices

2023-12-14
Explore SmartMore's journey in building an efficient AI training platform with insights into storag…