Skip to main content

Using JuiceFS in Hadoop via POSIX mounting

Hadoop or Spark\'s FileSystem API is pluggable and supports many different storage systems. JuiceFS can be supported by default through LocalFS, simply by mounting it to all nodes in the Hadoop cluster.

Preparations

Many Hadoop clusters do not have public network IP configured by default (such as UHadoop in Ucloud) , and JuiceFS needs to be able to access the public network when accessing metadata. One solution is to configure all Hadoop nodes with resilient IP, another option is to add external network-accessible routers to the Hadoop cluster, configured as follows:

Configure IP packet forwarding on a node with extranet IP:

echo 1 > /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/conf/all/forwarding
echo 1 > /proc/sys/net/ipv4/conf/default/forwarding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT

(It is recommended that add net.ipv4.ip_forward=1 to /etc/systl.conf and saving the rule as iptables-save so that they remain valid when the node is restarted.)

Use the above node as a router to access the external network on the Hadoop cluster (assuming the internal network segment is 10.0.0/8, 10.1.0.1 is the default router, and 10.1.100.100 is the internal IP of the router) :

ip route add 10.0.0.0/8 via 10.1.0.1
route del default gw 10.1.0.1
route add default gw 10.1.100.100

Mount JuiceFS

JuiceFS needs to be mounted to the same mount point, so /jfs is recommended, and later tutorials will assume that JuiceFS is mounted on /jfs.

wget juicefs.com/static/juicefs
chmod +x juicefs
sudo ./juicefs auth NAME --token TOKEN --accesskey ACCESS_KEY --secretkey SECRET_KEY
sudo ./juicefs mount NAME /jfs

To simplify administration, you could add the above commands to the cluster node\'s initialization script.

Accessing JuiceFS in Spark

Once the JuiceFS is mounted, using Spark to read and write the data in the JuiceFS is very simple, just point the path to the mount point of the JuiceFS, for example

spark.range(1, 1000).write.parquet("file:///jfs/a/range")
spark.read.parquet("file:///jfs/a/range")

If you can change the Hadoop configuration file core-site.xml to use LocalFS as the default file system,

<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>

or add the following to the Spark\'s configuration file $Spark/conf/spark-defaults.conf

spark.hadoop.fs.defaultFS = ”file:///jfs/”

Or add this to the front of a Spark script:

Spark > spark.sparkContext.hadoopConfiguration.set(“fs.defaultFS”, “file:///”)

So the above can be abbreviated to:

Spark > spark.range(1, 1000).write.parquet("/jfs/a/range")
Spark > spark.read.parquet("/jfs/a/range")

Other components in Hadoop access JuiceFS in a very similar way.

Using JuiceFS in Hive

Hive also supports JuiceFS. When JuiceFS is not the default storage of Hive, we need to create a Table as External Table and set the storage location in JuiceFS, such as:

$ mkdir /jfs/hive/test_json
Hive > CREATE external TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE location 'file:///jfs/hive/test_json';
Hive > LOAD DATA INPATH "file:///jfs/hive/test.json" OVERWRITE INTO TABLE test_json;
Hive > select * from test_json;

Hive extensions can also be placed in JuiceFS, such as:

Hive > Add Jar file:///jfs/hive/lib/hive-json-serde.jar;

If JuiceFS is set as the default storage for Hive, there is no need to change anything to use.

Note: The node where the Hive metastore is located also needs to mount JuiceFS to the same mount point.

Data Migration

Data migration from HDFS to JuiceFS is straightforward, with a few steps to do:

  • Deploy JuiceFS: Mount JuiceFS to the same mount point on each node in the Hadoop cluster,
  • To write new data to JuiceFS: Modify the application code or update the data path in configuration file:///jfs/xxx, so that new data will be written to JuiceFS.
  • If there is already a portion of data in HDFS, use the distcp command (provided by Hadoop) to copy the data into JuiceFS in parallel.
  • Modify the code that read the data from file:///jfs/xxx.
  • Delete data in HDFS.

JuiceFS for HBase

HBase can also run on top of JuiceFS, which requires to change hbase.rootdir to a path in JuiceFS in hbase-site.xml,for example:

<property>
<name>hbase.rootdir</name>
<value>file:///jfs/hbase</value>
</property>

Note:The onwner of the directory should be hadoop (the user HBase runs as).

The existing tables in HDFS can be migrated to JuiceFS by these steps:

  • Create snapshots for tables in HDFS.
hbase> snapshot 'MyTable', 'MySnapshot'
  • Copy the snapshots from HDFS to JuiceFS.
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to file:///jfs/hbase -mappers 16
  • Change the rootdir of HBase to JuiceFS, restart the cluster.
  • Re-create the tables, restore them from snapshots.
hbase> restore_snapshot 'MySnapshot'