Migrate data to JuiceFS
Hive, Spark, Impala
Data from data warehouse such as Hive can be stored on different storage systems. You can copy data from HDFS to JuiceFS via distcp
and update the data location of the table with alter table
.
Data from different partitions of the same table can also be stored in different storage systems, you could change the program to write new partition in JuiceFS, and copy existed partition to JuiceFS via distcp
, update the partition location at last.
HBase
Preparation
Assume you have a new HBase cluster with same configuration.
Offline Migration
Close all tables in ** Original HBase Cluster**:
bin/disable_all_tables.sh
In the New JuiceFS HBase Cluster:
Copy data via
distcp
:sudo -u hbase hadoop distcp -Dmapreduce.map.memory.mb=1500 your_filesystem/hbase/data jfs://your_jfs/hbase/data
Remove the Metadata of HBase
sudo -u hbase hadoop fs -rm -r jfs://your_jfs/hbase/data/hbase
Repair the Metadata
sudo -u hbase hbase hbck -fixMeta -fixAssignments
Online Migration
The next step is to import the historical data into the new JuiceFS HBase cluster and stay in sync with the original HBase cluster. Once the data is synchronized, the business can switch to the new JuiceFS HBase cluster.
In the Original HBase Cluster:
Turn on the replication:
add_peer '1', CLUSTER_KEY => "jfs_zookeeper:2181:/hbase",TABLE_CFS => { "your_table" => []}
enable_table_replication 'your_table'Tables will be created automatically(include splits).
Turn off the replication:
disable_peer("1")
Create a snapshot:
snapshot 'your_table','your_snapshot'
In the New JuiceFS HBase Cluster,
Import the snapshot:
sudo -u hbase hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
-snapshot your_snapshot \
-copy-from s3://your_s3/hbase \
-copy-to jfs://your_jfs/hbase \
-mappers 1 \
-bandwidth 20You could adjust
map
number and bandwidth(unit MB/s).Restore the snapshot:
disable 'your_table'
restore_snapshot 'your_snapshot'
enable 'your_table'
Turn on the replication in the Original HBase Cluster:
enable_peer("1")