Skip to main content

Data Synchronization

juicefs sync is a powerful data migration tool, which can copy data across all supported storages including object storage, JuiceFS itself, and local file systems, you can freely copy data between any of these systems. In addition, it supports remote directories through SSH, HDFS, WebDAV, etc. while providing advanced features such as incremental synchronization, and pattern matching (like rsync), and distributed syncing.

Basic Usage

Command Syntax

juicefs sync [command options] SRC DST

Synchronize data from SRC to DST, capable for both directories and files.

Arguments:

  • SRC is the source data address or path;
  • DST is the destination address or path;
  • [command options] are synchronization options. See command reference for more details.

Address format:

[NAME://][ACCESS_KEY:SECRET_KEY[:TOKEN]@]BUCKET[.ENDPOINT][/PREFIX]

# MinIO only supports path style
minio://[ACCESS_KEY:SECRET_KEY[:TOKEN]@]ENDPOINT/BUCKET[/PREFIX]

Explanation:

  • NAME is the storage type like s3 or oss. See available storage services for more details;
  • ACCESS_KEY and SECRET_KEY are the credentials for accessing object storage APIs; If special characters are included, it needs to be escaped and replaced manually. For example, / needs to be replaced with its escape character %2F.
  • TOKEN token used to access the object storage, as some object storage supports the use of temporary token to obtain permission for a limited time
  • BUCKET[.ENDPOINT] is the address of the object storage;
  • PREFIX is the common prefix of the directories to synchronize, optional.

Here is an example of the object storage address of Amazon S3.

s3://ABCDEFG:[email protected]

In particular, SRC and DST ending with a trailing / are treated as directories, e.g. movie/. Those don't end with a trailing / are treated as prefixes, and will be used for pattern matching. For example, assuming we have test and text directories in the current directory, the following command can synchronize them into the destination ~/mnt/.

juicefs sync ./te ~/mnt/te

In this way, the subcommand sync takes te as a prefix to find all the matching directories, i.e. test and text. ~/mnt/te is also a prefix, and all directories and files synchronized to this destination will be renamed by replacing the original prefix te with the new prefix te. The changes in the names of directories and files before and after synchronization cannot be seen in the above example. However, if we take another prefix, for example, ab,

juicefs sync ./te ~/mnt/ab

the test directory synchronized to the destination directory will be renamed as abst, and text will be abxt.

Required Storages

Assume that we have the following storages.

  1. Object Storage A

    • Bucket name: aaa
    • Endpoint: https://aaa.s3.us-west-1.amazonaws.com
  2. Object Storage B

    • Bucket name: bbb
    • Endpoint: https://bbb.oss-cn-hangzhou.aliyuncs.com
  3. JuiceFS File System

    • Metadata Storage: redis://10.10.0.8:6379/1
    • Object Storage: https://ccc-125000.cos.ap-beijing.myqcloud.com

All of the storages share the same secret key:

  • ACCESS_KEY: ABCDEFG
  • SECRET_KEY: HIJKLMN

Synchronize between Object Storage and JuiceFS

The following command synchronizes movies directory on Object Storage A to JuiceFS File System.

# mount JuiceFS
juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# synchronize
juicefs sync s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/

The following command synchronizes images directory from JuiceFS File System to Object Storage A.

# mount JuiceFS
juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# synchronization
juicefs sync /mnt/jfs/images/ s3://ABCDEFG:[email protected]/images/

Synchronize between Object Storages

The following command synchronizes all of the data on Object Storage A to Object Storage B.

juicefs sync s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

Synchronize between Local and Remote Servers

To copy files between directories on a local computer, simply specify the source and destination paths. For example, to synchronize the /media/ directory with the /backup/ directory:

juicefs sync /media/ /backup/

If you need to synchronize between servers, you can access the target server using the SFTP/SSH protocol. For example, to synchronize the local /media/ directory with the /backup/ directory on another server:

juicefs sync /media/ [email protected]:/backup/
# Specify password (optional)
juicefs sync /media/ "username:password"@192.168.1.100:/backup/

When using the SFTP/SSH protocol, if no password is specified, the sync task will prompt for the password. If you want to explicitly specify the username and password, you need to enclose them in double quotation marks, with a colon separating the username and password.

Sync Without Mount Point Added in v1.1

For data migrations that involve JuiceFS, it's recommended use the jfs:// protocol, rather than mount JuiceFS and access its local directory, which bypasses the FUSE mount point and access JuiceFS directly. Under large scale scenarios, bypassing FUSE can save precious resources and increase performance.

myfs=redis://10.10.0.8:6379/1 juicefs sync s3://ABCDEFG:[email protected]/movies/ jfs://myfs/movies/

Advanced Usage

Observation

Simply put, when using sync to transfer big files, progress bar might move slowly or get stuck. If this happens, you can observe the progress using other methods.

sync assumes it's mainly used to copy a large amount of files, its progress bar is designed for this scenario: progress only updates when a file has been transferred. In a large file scenario, every file is transferred slowly, hence the slow or even static progress bar. This is worse for destinations without multipart upload support (e.g. file, sftp, jfs, gluster schemes), where every file is transferred single-threaded.

If progress bar is not moving, use below methods to observe and troubleshoot:

  • If either end is a JuiceFS mount point, you can use juicefs stats to quickly check current IO status.
  • If destination is a local disk, look for temporary files that end with .tmp.xxx, these are the temp files created by sync, they will be renamed upon transfer complete. Look for size changes in temp files to verify the current IO status.
  • If both end are object storage services, use tools like nethogs to check network IO.

Incremental and Full Synchronization

The subcommand sync works incrementally by default, which compares the differences between the source and target paths, and then synchronizes only the differences. You can add option --update or -u to keep updated the mtime of the synchronized directories and files.

For full synchronization, i.e. synchronizing all the time no matter whether the destination files exist or not, you can add option --force-update or -f. For example, the following command fully synchronizes movies directory from Object Storage A to JuiceFS File System.

# mount JuiceFS
juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# full synchronization
juicefs sync --force-update s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/

Pattern Matching

The pattern matching function of the subcommand sync is similar to that of rsync, which allows you to exclude or include certain classes of files by rules and synchronize any set of files by combining multiple rules. Now we have the following rules available.

  • Patterns ending with / only matches directories; otherwise, they match files, links or devices.
  • Patterns containing *, ? or [ match as wildcards, otherwise, they match as regular strings;
  • * matches any non-empty path components (it stops at /).
  • ? matches any single character except /;
  • [ matches a set of characters, for example [a-z] or [[:alpha:]];
  • Backslashes can be used to escape characters in wildcard patterns, while they match literally when no wildcards are present.
  • It is always matched recursively using patterns as prefixes.

Exclude Directories/Files

Option --exclude can be used to exclude patterns. The following example shows a full synchronization from JuiceFS File System to Object Storage A, excluding hidden directories and files:

Remark

Linux regards a directory or a file with a name starts with . as hidden.

# mount JuiceFS
juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# full synchronization, excluding hidden directories and files
juicefs sync --exclude '.*' /mnt/jfs/ s3://ABCDEFG:[email protected]/

You can use this option several times with different parameters in the command to exclude multiple patterns. For example, using the following command can exclude all hidden files, pic/ directory and 4.png file in synchronization:

juicefs sync --exclude '.*' --exclude 'pic/' --exclude '4.png' /mnt/jfs/ s3://ABCDEFG:[email protected]

Include Directories/Files

Option --include can be used to include patterns you don't want to exclude. For example, only pic/ and 4.png are synchronized and all the others are excluded after executing the following command:

juicefs sync --include 'pic/' --include '4.png' --exclude '*' /mnt/jfs/ s3://ABCDEFG:[email protected]
NOTICE

The earlier options have higher priorities than the latter ones. Thus, the --include options should come before --exclude. Otherwise, all the --include options such as --include 'pic/' --include '4.png' which appear later than --exclude '*' will be ignored.

Filtering modes

Filtering modes determine how multiple filtering patterns decide whether to synchronize a path. The sync command supports two filtering modes: one-time filtering and layered filtering. By default, the sync command uses the layered filtering mode. You can use the --match-full-path parameter to switch to one-time filtering mode.

One-time filtering

In one-time filtering, the full path of an object is matched against multiple patterns in sequence.

One-time filtering

For example, given the a1./b1/c1.txt object and the inclusion/exclusion rule --include a*.txt, --include c1.txt, --exclude c*.txt, the string a1/b1/c1.txt is matched against the three patterns --include a*.txt, --inlude c1.txt, and --exclude c*.txt in sequence.

The specific steps are as follows:

  1. a1/b1/c1.txt matches against --include a*.txt, which fails to match.
  2. a1/b1/c1.txt is matched against --inlude c1.txt, which succeeds according to the matching rules. The final matching result for a1/b1/c1.txt is "sync."

The subsequent rule --exclude c*.txt would also match based on the suffix. But according to the sequential nature of include/exclude parameters, once a pattern is matched, no further patterns are evaluated. Therefore, the final result for matching a1/b1/c1.txt is determined by the --inlude c1.txt rule, which is "sync."

Here are some examples of exclude/include rules in the one-time filtering mode:

  • --exclude *.o excludes all files matching the pattern *.o.
  • --exclude /foo** excludes any file or directory in the root named foo.
  • --exclude **foo/** excludes all directories ending with foo.
  • --exclude /foo/*/bar excludes any bar file located two levels down within the foo directory in the root.
  • --exclude /foo/**/bar excludes any file named bar located within any level of the foo directory in the root. (** matches any number of directory levels)
  • Using --include */ --include *.c --exclude * includes only directories and c source files, excluding all other files and directories.
  • Using --include foo/bar.c --exclude * includes only the foo directory and foo/bar.c.

One-time filtering is easy to understand and use. It is recommended in most cases.

Layered filtering

Layered filtering

Layered filtering breaks down the object's path into sequential subpaths according to its layers. For example, the layer sequence for the path a1/b1/c1.txt is a1, a1/b1, and a1/b1/c1.txt.

Each element in this sequence is treated as the original path in a one-time filtering process. During one-time filtering, if a pattern matches and it is an exclude rule, it immediately returns "Exclude" as the result for the entire layered filtering of the object. If the pattern matches and it is an include rule, it skips the remaining rules for the current layer and proceeds to the next layer of filtering.

If no rules match at a given level, the filtering proceeds to the next layer. If all layers are processed without any match, the default action "sync" is returned.

For example, given the object a1/b1/c1.txt and the rules --include a*.txt, --include c1.txt, --exclude c*.txt, the layered filtering process for this example with the subpath sequence a1, a1/b1, and a1/b1/c1.txt will be as follows:

  1. At the first layer, the path a1 is evaluated against the rules --include a*.txt, --inlude c1.txt, and --exclude c*.txt. None of these rules match, so it proceeds to the next layer.

  2. At the second layer, the path a1/b1 is evaluated against the same rules. Again, no rules match, so it proceeds to the next layer.

  3. At the third layer, the path a1/b1/c1.txt is evaluated. The rule --include c1.txt matches. The behavior of this pattern is "sync." In layered filtering, if the filtering result for a level is "sync," it directly proceeds to the next layer of filtering.

  4. There are no more layers. All layers of filtering have been completed, so the default behavior is "sync."

In the example above, the match was successful at the last layer. Besides this, there are two other scenarios:

  • If a match occurs before the last layer, and it is an exclude pattern, the result is "exclude" for the entire filtering. If it is an include pattern, it proceeds to the next layer of filtering.
  • If all layers are completed without any matches, the default behavior is "sync."

In summary, layered filtering sequentially applies one-time filtering from the top layer to the bottom. Each layer of filtering can result in either "exclude," which is final, or "sync," which requires proceeding to the next layer.

Here are some examples of layered filtering with exclude/include rules:

  • --exclude *.o excludes all files matching "*.o".

  • --exclude /foo excludes files or directories with the root directory name "foo" during transfer.

  • --exclude foo/ excludes all directories named "foo".

  • --exclude /foo/*/bar excludes "bar" files under the "foo" directory, up to two layers deep from the root directory.

  • --exclude /foo/**/bar excludes "bar" files under the "foo" directory recursively at any layer from the root directory. ("**" matches any number of directory levels)

  • Using --include */ --include *.c --exclude * includes all directories and c source code files, excluding all other files and directories.

  • Using --include foo/ --include foo/bar.c --exclude * includes only the "foo" directory and "foo/bar.c" file. (The "foo" directory must be explicitly included, or it is excluded by the --exclude * rule.)

  • For dir_name/***, it matches all files at all layers under the dir_name directory. Note that each subpath element is recursively traversed from top to bottom, so include/exclude matching rules apply recursively to each full path element. For example, to include /foo/bar/baz, both /foo and /foo/bar should not be excluded. When a file is found to be transferred, the exclusion matching pattern short-circuits the exclusion traversal at that file's directory layer. If a parent directory is excluded, deeper include pattern matching is ineffective. This is crucial when using trailing *. For example, the following example will not work as expected:

    --include='/some/path/this-file-will-not-be-found' 
    --include='/file-is-included'
    --exclude='*'

    Due to the * rule excluding the parent directory some, it will fail. One solution is to request the inclusion of all directory structures by using a single rule like --include */, which must be placed before other --include* rules. Another solution is to add specific inclusion rules for all parent directories that need to be accessed. For example, the following rules can work correctly:

    --include /some/
    --include /some/path/
    --include /some/path/this-file-is-found
    --include /file-also-included
    --exclude *

Understanding and using layered filtering can be quite complex. However, it is mostly compatible with rsync's include/exclude parameters. Therefore, it is generally recommended to use it in scenarios compatible with rsync behavior.

Directory Structure and File Permissions

The subcommand sync only synchronizes file objects and directories containing file objects, and skips empty directories by default. To synchronize empty directories, you can use --dirs option.

In addition, when synchronizing between file systems such as local, SFTP and HDFS, option --perms can be used to synchronize file permissions from the source to the destination.

You can use --links option to disable symbolic link resolving when synchronizing local directories. That is, synchronizing only the symbolic links themselves rather than the directories or files they are pointing to. The new symbolic links created by the synchronization refer to the same paths as the original symbolic links without any conversions, no matter whether their references are reachable before or after the synchronization.

Some details need to be noticed

  1. The mtime of a symbolic link will not be synchronized;
  2. --check-new and --perms will be ignored when synchronizing symbolic links.

Concurrent data synchronization

juicefs sync by default starts 10 threads to run syncing jobs, you can set the --threads option to increase or decrease the number of threads as needed. But also note that due to various factors, blindly increasing --threads may not always work, and you should also consider:

  • SRC and DST storage systems may have already reached their bandwidth limits, if this is indeed the bottleneck, further increasing concurrency will not improve the situation;
  • Performing juicefs sync on a single host may be limited by host resources, e.g. CPU or network throttle, if this is the case, consider using distributed synchronization (introduced below);
  • If the synchronized data is mainly small files, and the list API of SRC storage system has excellent performance, then the default single-threaded list of juicefs sync may become a bottleneck. You can consider enabling concurrent list (introduced below).

Concurrent list

From the output of juicefs sync, pay attention to the Pending objects count, if this value stays zero, consumption is faster than production and you should increase --list-threads to enable concurrent list, and then use --list-depth to control list depth.

For example, if you're dealing with a object storage bucket used by JuiceFS, directory structure will be /<vol-name>/chunks/xxx/xxx/..., using --list-depth=2 will perform concurrent listing on /<vol-name>/chunks which usually renders the best performance.

Distributed synchronization

Synchronizing between two object storages is essentially pulling data from one and pushing it to the other. The efficiency of the synchronization will depend on the bandwidth between the client and the cloud.

JuiceFS-sync-single

When copying large scale data, node bandwidth can easily bottleneck the synchronization process. For this scenario, juicefs sync provides a multi-machine concurrent solution, as shown in the figure below.

JuiceFS-sync-worker

Manager node executes sync command as the master, and defines multiple worker nodes by setting option --worker (manager node itself also serve as a worker node). JuiceFS will split the workload distribute to Workers for distributed synchronization. This increases the amount of data that can be processed per unit time, and the total bandwidth is also multiplied.

When using distributed syncing, you should configure SSH logins so that the manager can access all worker nodes without password, if SSH port isn't the default 22, you'll also have to include that in the manager's ~/.ssh/config. Manager will distribute the JuiceFS Client to all worker nodes, so they should all use the same architecture to avoid running into compatibility problems.

For example, synchronize data from Object Storage A to Object Storage B concurrently with multiple machines.

juicefs sync --worker [email protected],[email protected] s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

The synchronization workload between the two object storages is shared by the current machine and the two Workers [email protected] and [email protected].

Application Scenarios

Geo-disaster Recovery Backup

Geo-disaster recovery backup backs up files, and thus the files stored in JuiceFS should be synchronized to other object storages. For example, synchronize files in JuiceFS File System to Object Storage A:

# mount JuiceFS
juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
# synchronization
juicefs sync /mnt/jfs/ s3://ABCDEFG:[email protected]/

After sync, you can see all the files in Object Storage A.

Build a JuiceFS Data Copy

Unlike the file-oriented disaster recovery backup, the purpose of creating a copy of JuiceFS data is to establish a mirror with exactly the same content and structure as the JuiceFS data storage. When the object storage in use fails, you can switch to the data copy by modifying the configurations. Note that only the file data of the JuiceFS file system is replicated, and the metadata stored in the metadata engine still needs to be backed up.

This requires manipulating the underlying object storage directly to synchronize it with the target object storage. For example, to take the Object Storage B as the data copy of the JuiceFS File System:

juicefs sync cos://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

After sync, the file content and hierarchy in the Object Storage B are exactly the same as the underlying object storage of JuiceFS.

Sync across regions using S3 Gateway

When transferring a large amount of small files across different regions via FUSE mount points, clients will inevitably talk to the metadata service in the opposite region via public internet (or dedicated network connection with limited bandwidth). In such cases, metadata latency can become the bottleneck of the data transfer:

sync via public metadata service

S3 Gateway comes to rescue in these circumstances: deploy a gateway in the source region, and since this gateway accesses metadata via private network, metadata latency is eliminated to a minimum, bringing the best performance for small file intensive scenarios.

sync via gateway

Read S3 Gateway to learn its deployment and use.