Skip to main content

How do you ensure that the data copied from HDFS to JuiceFS is correct?

When migrating data from HDFS to JuiceFS, you typically use DistCp to copy the data, which allows Checksum to ensure that the data is correct.

DistCp uses getFileChecksum() interface of HDFS to get the checksum of the file, and then compares the checksum of the copied file to make sure the data is the same.

By default, the Checksum algorithm used by Hadoop is MD5-MD5-CRC32, relying heavily on HDFS implementation details. It is based on the file\'s current block form, using MD5-CRC32 Algorithm summary of each data block checksum (summary each 64K block\'s CRC32 checksum, then calculate an MD5) , and then calculate the checksum using MD5. This algorithm can not be used to compare HDFS clusters with different block sizes.

To be compatible with HDFS, JuiceFS also implements the MD5-MD5-CRC32 Algorithm, which reads the file data again and uses the same algorithm to calculate a checksum for comparison.

Because JuiceFS is based on object storage, which already guarantees data integrity through a variety of checksum mechanisms, JuiceFS does not enable the above checksum algorithm by default, requiring the parameter juicefs.file.checksum to enable.

Because the algorithm depends on the same block size, you need to pass the parameter juicefs.block.size sets the block size as same as HDFS (the parameter is dfs.blocksize, which defaults to 128MB) .

In addition, HDFS supports a different block size for each file, while JuiceFS does not. If checksum is enabled, the copy of some files will fail (due to different block sizes) , the JuiceFS SDK has a hotfix for DistCp (which requires tools.jar) to skip these chunked files (without comparing, rather than throwing exceptions) .

When you use DistCp for incremental copying, the default is to determine whether the contents of the file are the same based on the time and size of the file changes, which is sufficient for HDFS (because HDFS does not allow you to change the file contents) . If you want to further ensure that the data on both sides is the same, you can enable checksum comparisons with the -update parameter.