Skip to main content

JuiceFS for Hadoop Workloads (Replacing HDFS)

Hadoop and Spark are very popular big data analysis platforms, their built-in storage system HDFS is a complicated component, bring many challenges to the system administration, With some technical limitations, JuiceFS addresses those issues well and does not require any maintenance to provide a better storage solution for Hadoop and Spark platforms on the public cloud.

When a company using Hadoop migrates to a public cloud, the first challenge is how to migrate the data stored in HDFS. Often, public cloud does not provide a fully managed HDFS solution and still requires its own customer operations. In addition, although HDFS is a common choice for self-built data storage facilities, it does not work well with existing storage products in the public cloud, greatly reducing its effectiveness and failing to leverage the resilience of the public cloud.

There are also problems with object storage. Object storage is not a file system, lacks some of the features heavily relied upon by Hadoop and Spark (strong consistency and atomic renaming, etc.) There is no guarantee that the Hadoop and Spark jobs will be correct, stable and efficient. As a file system based on object storage, JuiceFS, while keeping the scalability of object storage, maintenance free and low cost, but also ensure the correctness, stability and efficiency of large-scale data analysis.

We will outline how JuiceFS can handle petabyte-scale production tasks faster, with less expense and lower administrative overhead.

Benefits of JuiceFS for Hadoop Users

  • Lower Storage Cost

    Unlike HDFS, JuiceFS is a fully managed storage solution that does not require CPU and memory overhead to maintain services or more than three times the amount of storage space pre-deployed. It relies on object storage systems to provide redundancy and durability guarantees. With the same amount of data, JuiceFS can reduce HDFS storage costs by one data size.

  • No stop-the-world garbage collection (GC) on NameNodes

    HDFS is written in Java and suffers from stop-the-world garbage collection issues that cause the entire cluster to stop processing at unpredictable times. JuiceFS has no such limitations.

  • No continuous capacity management/tuning

    HDFS usually requires ongoing capacity planning and management, and continues to scale vertically or horizontally to meet changing storage needs. And JuiceFS is completely flexible, only need to pay for the actual usage, no need to do capacity planning and capacity expansion operation.

  • No HA management

    HDFS requires continuous monitoring and maintenance operations to ensure high availability of services, JuiceFS as a highly available service will have a dedicated team to help you solve these problems, A more efficient failover solution will also provide higher availability than HDFS.

  • No expensive Professional Services contracts that price-scale to the size of your cluster

    Due to the complexity of HDFS, many companies buy expensive third-party professional services to ensure the stable operation of HDFS. JuiceFS as a fully managed service, its price already includes professional services, we will be responsible for the reliable and stable operation of JuiceFS, allowing you to put cash and energy where you need more.

  • Replicate across all regions/zones or across heterogeneous cloud providers.

    HDFS does not support remote data replication, customers need to design and implement other complex data replication programs, the effect is very limited. JuiceFS allows you to copy data to any area of any cloud, allowing you to access the same data very efficiently both at the same time, and seamlessly migrate computing tasks between two public clouds or two regions.

  • Mirror all or part of your data to any place around the world in near real-time.

    JuiceFS also provides nearing real-time data mirroring of any of the public clouds and regions around the world, with only seconds of data latency while ensuring data consistency.

  • Juicedata (the company) never sees your data.

    JuiceFS clients installed on your VM communicate with the object store directly, and your data will never go through our servers or third-party proxies to guarantee the absolute privacy of your data. Data replication is done entirely through the client on your VM.

User Guide

The storage part of Hadoop and Spark is pluggable, they support many kinds of storage system. JuiceFS could be supported by LocalFS as default, you should mount JuiceFS on your every compute nodes of the Hadoop cluster.

Prepare

Many Hadoop clusters do not have public IP addresses configured by default. JuiceFS needs to be able to access the public network when accessing metadata. One solution is to configure elastic IP for all Hadoop nodes. The other way is to add a router for the Hadoop cluster to access the public network:

Configure IP packet forwarding on a node with public IP:

echo 1 > /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/conf/all/forwarding
echo 1 > /proc/sys/net/ipv4/conf/default/forwarding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT

(It is recommended to add net.ipv4.ip_forward = 1 to / etc / systl.conf and use iptables-save to save the rules so they will still be valid when the node restarts of)

Use the above nodes as a router for Hadoop cluster nodes to access the public network(assuming the intranet network segment is 10.0.0.0 / 8, 10.1.0.1 is the default router, 10.1.100.100 is the router\'s intranet IP):

ip route add 10.0.0.0/8 via 10.1.0.1
route del default gw 10.1.0.1
route add default gw 10.1.100.100

Mount JuiceFS

JuiceFS need to be mounted to the same mount point, it is recommended to use /jfs. In the rest of the tutorial we will assume that JuiceFS is mounted on /jfs.

wget juicefs.com/static/juicefs
chmod +x juicefs
sudo ./juicefs auth NAME --token TOKEN --accesskey ACCESS_KEY --secretkey SECRET_KEY
sudo ./juicefs mount NAME /jfs

For automatic management, you can add the above commands to the cluster node initialization script.

JuiceFS for Spark

Once JuiceFS is mounted, using Spark to read and write data in JuiceFS will be as easy as pointing the path to the JuiceFS mount point, such as

spark.range(1, 1000).write.parquet("file:///jfs/a/range")
spark.read.parquet("file:///jfs/a/range")

If you can change the Hadoop configuration file (core-site.xml), use LocalFS as the default file system,

<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>

Or add the following to Spark\'s configuration file $SPARK_HOME/conf/spark-defaults.conf

spark.hadoop.fs.defaultFS = ”file:///jfs/”

Or add the following line to the front of the Spark script:

Spark > spark.sparkContext.hadoopConfiguration.set(“fs.defaultFS”, “file:///”)

Then the above could be simplified as:

Spark > spark.range(1, 1000).write.parquet("/jfs/a/range")
Spark > spark.read.parquet("/jfs/a/range")

The way other components of Hadoop access JuiceFS is very similar.

JuiceFS for Hive

Hive also supports JuiceFS. If JuiceFS is not the default Hive storage, you need to create the table as External Table and configure the storage location in JuiceFS, for example:

$ mkdir /jfs/hive/test_json
Hive > CREATE external TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE location 'file:///jfs/hive/test_json';
Hive > LOAD DATA INPATH "file:///jfs/hive/test.json" OVERWRITE INTO TABLE test_json;
Hive > select * from test_json;

Hive extensions can also be placed in JuiceFS, such as:

Hive > Add Jar file:///jfs/hive/lib/hive-json-serde.jar;

If JuiceFS is set as the default storage for Hive, there is no need to change anything to use.

Note: The node where the Hive metastore is located also needs to mount JuiceFS to the same mount point.

Data Migration

Data migration from HDFS to JuiceFS is straightforward, with a few steps to do:

  • Deploy JuiceFS: Mount JuiceFS to the same mount point on each node in the Hadoop cluster,
  • To write new data to JuiceFS: Modify the application code or update the data path in configuration file:///jfs/xxx, so that new data will be written to JuiceFS.
  • If there is already a portion of data in HDFS, use the distcp command (provided by Hadoop) to copy the data into JuiceFS in parallel.
  • Modify the code that read the data from file:///jfs/xxx.
  • Delete data in HDFS.

JuiceFS for HBase

HBase can also run on top of JuiceFS, which requires to change hbase.rootdir to a path in JuiceFS in hbase-site.xml, for example:

<property>
<name>hbase.rootdir</name>
<value>file:///jfs/hbase</value>
</property>

Note:The onwner of the directory should be hadoop (the user HBase runs as).

The existing tables in HDFS can be migrated to JuiceFS by these steps:

  • Create snapshots for tables in HDFS.
hbase> snapshot 'MyTable', 'MySnapshot'
  • Copy the snapshots from HDFS to JuiceFS.
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to file:///jfs/hbase -mappers 16
  • Change the rootdir of HBase to JuiceFS, restart the cluster.
  • Re-create the tables, restore them from snapshots.
hbase> restore_snapshot 'MySnapshot'