Data migration
juicefs sync
is a powerful data migration tool, which can copy data across all supported storages including object storage, JuiceFS itself, and local file systems, you can freely copy data between any of these systems. In addition, it supports remote directories through SSH, HDFS, WebDAV, etc. while providing advanced features such as incremental synchronization, and pattern matching (like rsync), and distributed syncing.
For data migrations that involve JuiceFS, it's recommended use the jfs://
protocol, rather than mount JuiceFS and access its local directory, which bypasses the FUSE mount point and access JuiceFS directly. This process still requires the client configuration file, you should prepare in advance using juicefs auth
. Under large scale scenarios, bypassing FUSE can save precious resources and increase performance.
juicefs sync
works like this:
juicefs sync [command options] SRC DST
# Sync object from OSS to S3
juicefs sync oss://mybucket.oss-cn-shanghai.aliyuncs.com s3://mybucket.s3.us-east-2.amazonaws.com
# Sync objects from S3 to JuiceFS
juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: aaa/b1
juicefs sync --exclude='a?/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: a1/b1,aaa/b1
juicefs sync --include='a1/b1' --exclude='a[1-9]/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
# SRC: a1/b1,a2/b2,aaa/b1,b1,b2 DST: empty sync result: a1/b1,b2
juicefs sync --include='a1/b1' --exclude='a*' --include='b2' --exclude='b?' s3://mybucket.s3.us-east-2.amazonaws.com/ jfs://VOL_NAME/
Trailing slash
SRC
and DST
ending with a trailing /
are treated as directories, e.g. movie/
. Those don't without a trailing /
are treated as prefixes, and will be used for pattern matching. Assuming we have test
and text
directories in the current directory, the following command can synchronize them into the destination ~/mnt/
:
juicefs sync ./te ~/mnt/te
In this way, the subcommand sync
takes te
as a prefix to find all the matching directories, i.e. test
and text
. ~/mnt/te
is also a prefix, and all directories and files synchronized to this destination will be renamed by replacing the original prefix te
with the new prefix te
. The changes in the names of directories and files before and after synchronization cannot be seen in the above example. However, if we take another prefix, for example, ab
:
juicefs sync ./te ~/mnt/ab
The test
directory synchronized to the destination directory will be renamed as abst
, and text
will be abxt
.
Incremental and full synchronization
By default, juicefs sync
performs incremental synchronization, existing files are overwritten when their sizes don't match. On top of this, you can also use --update
to copy files when source files' mtime
are newer.
For full synchronization (copy everything no matter already exists or not), use --force-update
.
Pattern matching
Similar to rsync, you can use --exclude
, --include
patterns to build filters:
- Patterns ending with
/
only matches directories; otherwise, they match files, links or devices. - Patterns containing
*
,?
or[
match as wildcards, otherwise, they match as regular strings; *
matches any non-empty path components (it stops at/
).?
matches any single character except/
;[
matches a set of characters, for example[a-z]
or[[:alpha:]]
;- Backslashes can be used to escape characters in wildcard patterns, while they match literally when no wildcards are present.
- It is always matched recursively using patterns as prefixes.
- The earlier options have higher priorities than the latter ones. Thus, the
--include
options should come before--exclude
. Otherwise, all the--include
options such as--include 'pic/' --include '4.png'
which appear later than--exclude '*'
will be ignored.
To sync everything, but excludes hidden directories and files (name starting with .
is regarded as hidden):
# Excluding hidden directories and files
juicefs sync --exclude '.*' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com/
You can use this option several times with different parameters to exclude multiple patterns. For example, using the following command can exclude all hidden files, pic/
directory and 4.png
:
juicefs sync --exclude '.*' --exclude 'pic/' --exclude '4.png' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com
Option --include
can be used to include patterns you don't want to exclude. For example, only pic/
and 4.png
are synchronized and all the others are excluded:
juicefs sync --include 'pic/' --include '4.png' --exclude '*' /tmp/dir/ s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com
Multi-threading and bandwidth throttling
juicefs sync
uses 10 threads by default, adjust using the --thread
option.
In addition, you can set option --bwlimit
in the unit Mbps
to limit the bandwidth used by the synchronization. The default value is 0
, meaning that bandwidth will not be limited.
Directory structure and file permissions
juicefs sync
skips empty directories by default. To synchronize empty directories, you can use --dirs
option.
In addition, when synchronizing between file systems such as local, SFTP and HDFS, option --perms
can be used to synchronize file permissions.
Copy symbolic links
You can use --links
option to disable symbolic link resolving when synchronizing local directories. That is, synchronizing only the symbolic links themselves rather than the directories or files they are pointing to. The new symbolic links created by the synchronization refer to the same paths as the original symbolic links without any conversions, no matter whether their references are reachable before or after the synchronization. Also note that:
- The
mtime
of a symbolic link will not be synchronized; - The
--check-new
and--perms
options will be ignored when synchronizing symbolic links.
Distributed data synchronization
Synchronizing between two object storages is essentially pulling data from one and pushing it to the other. The efficiency of the synchronization will depend on the bandwidth between the client and the cloud.
When copying large scale data, node bandwidth can easily bottleneck the synchronization process. For this scenario, juicefs sync
provides a multi-machine concurrent solution, as shown in the figure below.
Manager node executes sync
command as the master, and defines multiple worker nodes by setting option --worker
(manager node itself also serve as a worker node). JuiceFS will split the workload distribute to Workers for distributed synchronization. This increases the amount of data that can be processed per unit time, and the total bandwidth is also multiplied.
When using distributed syncing, you should configure SSH logins so that the manager can access all worker nodes without password, if SSH port isn't the default 22, you'll also have to include that in the manager's ~/.ssh/config
. Manager will distribute the JuiceFS Client to all worker nodes, so they should all use the same architecture to avoid running into compatibility problems.
For example, to synchronize data between two object storage services:
juicefs sync --worker bob@192.168.1.20,tom@192.168.8.10 s3://ABCDEFG:HIJKLMN@aaa.s3.us-west-1.amazonaws.com oss://ABCDEFG:HIJKLMN@bbb.oss-cn-hangzhou.aliyuncs.com
The synchronization workload between the two object storages is shared by the manager machine and the two Workers bob@192.168.1.20
and tom@192.168.8.10
.