Hadoop or Spark\'s FileSystem API is pluggable and supports many different storage systems. JuiceFS can be supported by default through LocalFS, simply by mounting it to all nodes in the Hadoop cluster.
Many Hadoop clusters do not have public network IP configured by default (such as UHadoop in Ucloud) , and JuiceFS needs to be able to access the public network when accessing metadata. One solution is to configure all Hadoop nodes with resilient IP, another option is to add external network-accessible routers to the Hadoop cluster, configured as follows:
Configure IP packet forwarding on a node with extranet IP:
echo 1 > /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/conf/all/forwarding
echo 1 > /proc/sys/net/ipv4/conf/default/forwarding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT
(It is recommended that add
/etc/systl.conf and saving the rule as
iptables-save so that they remain valid when the node is restarted.)
Use the above node as a router to access the external network on the Hadoop cluster (assuming the internal network segment is
10.1.0.1 is the default router, and
10.1.100.100 is the internal IP of the router) :
ip route add 10.0.0.0/8 via 10.1.0.1
route del default gw 10.1.0.1
route add default gw 10.1.100.100
JuiceFS needs to be mounted to the same mount point, so
/jfs is recommended, and later tutorials will assume that JuiceFS is mounted on
chmod +x juicefs
sudo ./juicefs auth NAME --token TOKEN --accesskey ACCESS_KEY --secretkey SECRET_KEY
sudo ./juicefs mount NAME /jfs
To simplify administration, you could add the above commands to the cluster node\'s initialization script.
Accessing JuiceFS in Spark
Once the JuiceFS is mounted, using Spark to read and write the data in the JuiceFS is very simple, just point the path to the mount point of the JuiceFS, for example
If you can change the Hadoop configuration file
core-site.xml to use LocalFS as the default file system,
or add the following to the Spark\'s configuration file
spark.hadoop.fs.defaultFS = ”file:///jfs/”
Or add this to the front of a Spark script:
Spark > spark.sparkContext.hadoopConfiguration.set(“fs.defaultFS”, “file:///”)
So the above can be abbreviated to:
Spark > spark.range(1, 1000).write.parquet("/jfs/a/range")
Spark > spark.read.parquet("/jfs/a/range")
Other components in Hadoop access JuiceFS in a very similar way.
Using JuiceFS in Hive
Hive also supports JuiceFS. When JuiceFS is not the default storage of Hive, we need to create a Table as External Table and set the storage location in JuiceFS, such as:
$ mkdir /jfs/hive/test_json
Hive > CREATE external TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE location 'file:///jfs/hive/test_json';
Hive > LOAD DATA INPATH "file:///jfs/hive/test.json" OVERWRITE INTO TABLE test_json;
Hive > select * from test_json;
Hive extensions can also be placed in JuiceFS, such as:
Hive > Add Jar file:///jfs/hive/lib/hive-json-serde.jar;
If JuiceFS is set as the default storage for Hive, there is no need to change anything to use.
Note: The node where the Hive metastore is located also needs to mount JuiceFS to the same mount point.
Data migration from HDFS to JuiceFS is straightforward, with a few steps to do:
- Deploy JuiceFS: Mount JuiceFS to the same mount point on each node in the Hadoop cluster,
- To write new data to JuiceFS: Modify the application code or update the data path in configuration
file:///jfs/xxx, so that new data will be written to JuiceFS.
- If there is already a portion of data in HDFS, use the
distcpcommand (provided by Hadoop) to copy the data into JuiceFS in parallel.
- Modify the code that read the data from
- Delete data in HDFS.
JuiceFS for HBase
HBase can also run on top of JuiceFS, which requires to change
hbase.rootdir to a path in JuiceFS in hbase-site.xml，for example:
Note：The onwner of the directory should be
hadoop (the user HBase runs as).
The existing tables in HDFS can be migrated to JuiceFS by these steps:
- Create snapshots for tables in HDFS.
hbase> snapshot 'MyTable', 'MySnapshot'
- Copy the snapshots from HDFS to JuiceFS.
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to file:///jfs/hbase -mappers 16
- Change the rootdir of HBase to JuiceFS, restart the cluster.
- Re-create the tables, restore them from snapshots.
hbase> restore_snapshot 'MySnapshot'