File Import and Conversion
By default, JuiceFS stores files in blocks and separates metadata from data. This storage format and separation architecture enable JuiceFS to be a high-performance and strongly consistent file system.
However, in some rare scenarios, users prefer to store original files directly in object storage, allowing the files in object storage to be separated from JuiceFS metadata usage. Alternatively, they may want to directly import a large number of existing files from object storage into JuiceFS, enabling them to be accessed via POSIX and benefit from JuiceFS' powerful caching capabilities. Storing complete files in object storage and using them in JuiceFS is referred to as the "compatible format," distinguishing it from the default "optimized format."
Starting from version 5.0, JuiceFS has significantly improved support for the compatible format, providing the following features to meet the aforementioned requirements:
- The import feature for object storage, also known as the
juicefs import
command. This command has been available for some time, but starting from version 5.0, imported files support read caching as well. - The convert feature, which reassembles optimized-format blocks in JuiceFS back to original files and uploads them to object storage. This allows you to directly access the original files in object storage, with caching support.
Import existing object storage files
juicefs import
scans the given object storage address and writes the metadata information of the target file into JuiceFS' metadata engine, allowing these files to be accessed in JuiceFS. This operation does not actually copy any files; the files remain as they are in object storage. Therefore, this storage format is called the compatible format, meaning it is compatible with object storage.
When you use imported files, please note:
- Imported files also occupy file system space, contribute to directory quotas, and are included in billing.
- You can modify file names and permissions, but you cannot modify the object storage data. In other words, no matter what operation you perform, the original objects in object storage will remain unchanged.
- Deleting these files will only delete their metadata and will not actually delete the source files in object storage.
- The imported files' metadata in JuiceFS does not support the trash feature. If you delete imported files in JuiceFS, you will not find them in the trash. If you need to recover them, you can only re-import them.
- Files imported into JuiceFS cannot be easily distinguished from regular files. If you need to check, use the
juicefs info
command and focus on theobject
field (rather than achunks
table) to determine whether it is stored in compatible format.
Cache for import
Starting from JuiceFS 5.0, imported files also support local cache and distributed cache. Although imported files are not actually written to the JuiceFS file system and do not go through JuiceFS' sharded formatting process, when cached to the local disk, they are still split into data blocks (the size is the file system's block size). Therefore, the usage and management of cache for imported files are no different from normal files written to the JuiceFS file system.
Caching isn't supported for external buckets. This means in order to have caching support, the JuiceFS volume and import source must be the same bucket, and execute the import command in the following format:
# URI doesn't include bucket name, caching is supported
juicefs import / /jfs/imported
juicefs import /prefix /jfs
# If URI contains bucket, caching is no longer supported
# In the below example, even if BUCKET is the same as the JuiceFS volume bucket, caching will not be available
juicefs import BUCKET/prefix /jfs
When you use JuiceFS' cache feature to speed up the reading of imported files, it is important to note consistency issues: since the imported objects themselves are not managed by JuiceFS, if the objects are modified without being re-imported into JuiceFS, old versions of the cache may exist, and there is no guarantee that the latest data can be read. Therefore, if changes occur in the objects after they are imported into JuiceFS, they need to be re-imported. Existing cache data will automatically become invalid based on the modification time of the imported objects. This makes sure you can read the modified data.
For objects that need to be modified repeatedly, it is recommended to migrate the data to JuiceFS as a whole, using juicefs sync
to write data to JuiceFS. Because of JuiceFS' POSIX compatibility, you can use any other tool as well.
Import on demand (experimental feature)
Starting from 5.1.3, JuiceFS supports import-on-demand as an experimental feature, built on top of the above import feature, import-on-demand maps the entire object storage bucket as a JuiceFS file system, the difference is that it only scans object storage and build metadata when a directory is accessed. This feature offers the following benefit compared to a simple periodical import:
- Significantly relieve object storage list pressure because it only happens when accessed;
- When an object storage bucket is too large, doing a full import can put stress on the node’s memory. You can use on-demand import to access the whole bucket while minimizing memory overhead.
This is still an experimental feature, which means its usage, even design, is subject to change in the future. If you'd like to evaluate, do contact a Juicedata engineer and discuss the process with us.
Synopsis
Import-on-demand can only access objects inside the linked bucket, so you must associate the source bucket with the file system right from the beginning (via the JuiceFS Web Console), only then can you mount the file system on a server.
In the following command, --source=/
means mapping the object storage bucket root onto the file system root, currently this argument is fixed and you cannot pass values other than /
.
juicefs mount myjfs /jfs --source=/
After a successful mount, run ls
to list the top-level directories:
$ cd /jfs
$ ls -alh
...
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir1
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir2
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir3
Note the above 777
permission, in an import-on-demand file system, permission is attached with special meaning: a 777
directory means that it hasn't been accessed, so there's no metadata underneath. But if this directory is accessed by any means, the JuiceFS Client scans the object storage and quickly generates the corresponding directory structure. After that, directory permission changes:
# Access dir1 by any means (cd, ls, or just read directly)
$ ls dir1/file.txt
# After the directory has been accessed, permission changes from 777 to 555
$ ls -alh
...
dr-xr-xr-x 5 root root 16K Nov 11 17:47 dir1
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir2
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir3
Hence, in an import-on-demand workflow, metadata load state is decided by directory permission: 777
means metadata is not yet loaded, others mean metadata is already loaded, and is available for read or cache warmup. It's important that you do not tamper with directory permissions.
Write support (not recommended)
Similar to the above "import" feature, you cannot modify an object storage file via JuiceFS, and even if you delete them from your JuiceFS volume, only the metadata index is deleted, which has no effect on the object storage bucket.
But as a matter of fact, an import-on-demand file system can still write new files just like any other JuiceFS file system, however, the file created are not uploaded directly to object storage as-is, but splitted and written to object storage in standard JuiceFS' sharding format, under the /[volume-name]/chunks/
prefix in the object storage bucket. So if you do create new files in an import-on-demand file system, there'll be two types of file inside the bucket: the original bucket objects, and those splitted files written via JuiceFS.
Due to the management difficulties, we do not recommend that you write to an import-on-demand file system, it can get confusing to distinguish between imported objects and files written natively, a worse situation is when metadata is expired, the next scan will rebuild the metadata and simply erase any excessive files, which means all written files will be deleted.
Metadata expiration
After an import, objects are subject to modification, in order to see all the latest changes on the object storage's side, you must set an appropriate refresh interval:
juicefs mount myjfs /jfs --source=/ --refresh-interval=1h
The above setting expires the imported metadata after one hour, any access (like ls
, or lookup
on a non-existent file) triggers a new scan to rebuild the metadata. When multiple clients are all trying to initiate a scan, a file lock will be used to ensure only one client is generating metadata on the affected directory, which also alleviate the list pressure on the object storage. But depending on the actual scale and list performance, the scan process can still hang the read request for a while, after the metadata is rebuilt, subsequent accesses are fast again. Users should set this refresh interval based on how frequent object storage file list changes, as well as the actual consistency requirement of the application.
If application is very sensitive to latency, and want to avoid the lag caused by rebuilding metadata, then you can enable async refresh so that the scan doesn't block the access.
juicefs mount myjfs /jfs --source=/ --refresh-interval=1h --refresh-in-background
You should also note that if --refresh-in-background
has been enabled, there's no consistency guarantee defined by --refresh-interval
. Take the above 1 hour interval as an instance, with background refresh enabled, and metadata has been invalidated after an hour, running ls
returns the current metadata immediately, even if the scan has not been completed.
Cache and warmup
Running the juicefs warmup
command builds cache only for the imported data, i.e. directories with 555
permission. If juicefs warmup
is run against a 777
directory, then nothing happens because metadata isn't imported yet. This design ensures that cache is also built on demand, so users must access the desired data in advance, and build metadata for the directories.
To demonstrate:
$ ls -alh
...
dr-xr-xr-x 5 root root 16K Nov 11 17:47 dir1
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir2
drwxrwxrwx 2 root root 4.0K Nov 11 17:35 dir3
# Warming up a 777 directory does nothing
$ juicefs warmup dir3
# Only accessed directories can be warmed up
$ juicefs warmup dir1
Similar to the import command, files imported on demand supports local caching and distributed caching as well, read the relevant section to learn more.
Convert
When the convert feature is enabled, files are converted into complete files and stored in object storage after a specified period.
Typical use cases
-
Files are initially written by JuiceFS, but later needed to be directly accessed from object storage to integrate with cloud ecosystems. The emphasis is on directly access here, because JuiceFS itself does provide S3 API through S3 Gateway, if you just need to provide S3 API for your file system, use our S3 Gateway instead.
-
Utilize the archiving capabilities of object storage to archive cold data. The archived data can be taken out and accessed without JuiceFS metadata.
Still, the emphasis is on use without JuiceFS metadata, because JuiceFS natively supports separation of cold and hot data, simply use
--storage-class
to specify a storage class, which is much simpler. -
Compliance with data regulations that require files to be stored in its original intact format.
-
Other scenarios that demand data be separated from JuiceFS metadata, and can be taken out to use without JuiceFS.
Forbidden use cases
Convert is a experimental feature designed for some very special occasions, if your use case isn't listed above, you should never use this feature because it poses some important limitations on the file system (for example, converted files are read-only), continue to the below section to learn more.
These are some of the cases that should not (but can be easily mistaken) be used with the convert feature:
- You need a S3 endpoint to access your JuiceFS file system. Our S3 Gateway is specifically built for this type of use, and shouldn't involve the convert feature at all.
- Separate hot/cold data. JuiceFS Client can specify a storage class (via
--storage-class
) duringjuicefs auth
, so that different clients can handle files destined for different storage classes.
Synopsis
The effect of conversion on the object storage file list is as follows:
# Before conversion
mybucket/
├── chunks
│ ├── 41
│ │ └── 1
│ │ ├── 1000001_0_4194304
│ │ └── 1000001_10_4194304
│ ├── 43
│ │ └── 1
│ │ ├── 1000003_0_4194304
...
# After conversion, files are written into object storage as is, preserving the directory structure. The original sharded-format data blocks are deleted.
mybucket/
├── bigfile1.tar.gz
├── chunks
├── dir/bigfile2.tar.gz
Because the purpose of conversion is to decouple from the JuiceFS sharded format and store data as is in object storage, files are stored according to the directory structure in the file system. Therefore, if the convert feature is enabled, the file system must exclusively occupy the object storage bucket. To avoid conflicts and potential data loss, it should not be used for multiple purposes or other JuiceFS file systems.
After conversion, files no longer support content modifications, and write operations will result in permission errors. While they cannot be edited, they can be moved using the mv
command. In JuiceFS, this command is interpreted as a "cross-device copy + delete." It reads the file normally from the compatible format, writes it back to JuiceFS in sharded format as a new file, and delete the original file. As the mv
command converts the file from the compatible format back to the sharded format, the file can be edited again until the specified time has passed, at which point it can be converted again.
For file systems with the convert feature enabled, directories created a while after their creation (including empty directories) cannot be moved (mv
). They can only be deleted and then recreated.
Converted files do not support the trash feature. Once deleted, they do not appear in the trash, and they cannot be recovered. The object storage side will also perform asynchronous cleanup through client background tasks.
The convert feature is currently a beta feature, if you want to evaluate, contact a Juicedata engineer and we'll enable this feature for your file system, once enabled, you can navigate to the settings page and adjust relevant settings there.
Cache for conversion
Files in the compatible format support local cache and distributed cache. Even though converted files are no longer in sharded format, when cached to local storage, they are still split into data blocks (sized according to the file system's block size). Therefore, the usage and management of caching for converted files is no different from sharded-format files.
However, it is important to note that after conversion, existing cache is invalidated due to metadata changes, and the files need to be warmed up again to reestablish local cache.