Skip to main content

Migrate data to JuiceFS

Hive, Spark, Impala

Data from data warehouse such as Hive can be stored on different storage systems. You can copy data from HDFS to JuiceFS via distcp and update the data location of the table with alter table.

Data from different partitions of the same table can also be stored in different storage systems, you could change the program to write new partition in JuiceFS, and copy existed partition to JuiceFS via distcp, update the partition location at last.

HBase

Preparation

Assume you have a new HBase cluster with same configuration.

Offline Migration

Close all tables in ** Original HBase Cluster**:

bin/disable_all_tables.sh

In the New JuiceFS HBase Cluster:

  • Copy data via distcp:

    sudo -u hbase hadoop distcp -Dmapreduce.map.memory.mb=1500 your_filesystem/hbase/data jfs://your_jfs/hbase/data
  • Remove the Metadata of HBase

    sudo -u hbase hadoop fs -rm -r jfs://your_jfs/hbase/data/hbase
  • Repair the Metadata

    sudo -u hbase hbase hbck -fixMeta -fixAssignments

Online Migration

The next step is to import the historical data into the new JuiceFS HBase cluster and stay in sync with the original HBase cluster. Once the data is synchronized, the business can switch to the new JuiceFS HBase cluster.

In the Original HBase Cluster:

  • Turn on the replication:

    add_peer '1', CLUSTER_KEY => "jfs_zookeeper:2181:/hbase",TABLE_CFS => { "your_table" => []}
    enable_table_replication 'your_table'

    Tables will be created automatically(include splits).

  • Turn off the replication:

    disable_peer("1")
  • Create a snapshot:

    snapshot 'your_table','your_snapshot'

In the New JuiceFS HBase Cluster,

  • Import the snapshot:

    sudo -u hbase hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
    -snapshot your_snapshot \
    -copy-from s3://your_s3/hbase \
    -copy-to jfs://your_jfs/hbase \
    -mappers 1 \
    -bandwidth 20

    You could adjust map number and bandwidth(unit MB/s).

  • Restore the snapshot:

    disable 'your_table'
    restore_snapshot 'your_snapshot'
    enable 'your_table'

Turn on the replication in the Original HBase Cluster:

enable_peer("1")