Using JuiceFS in Hadoop

For users that are migrating big data infrastructure from IDC to public cloud services, JuiceFS is a great alternative or supplement for HDFS, providing better efficiency while lowering cost. Read "From Hadoop to cloud native — Separation of compute and storage for big data platform" for more information.

If you decide to adopt JuiceFS in your Hadoop infrastructure, JuiceFS Hadoop SDK is the recommended way. In theory you can use RawLocalFileSystem to access JuiceFS mount point, but due to HDFS client implementation, this practice comes with performance issues and is not recommended.

Before getting started with Hadoop SDK, read this chapter for an introduction.

Quick start

The following is a video tutorial on how to use the JuiceFS Hadoop SDK in a Hadoop environment using Amazon EMR as an example:

User / Group administration

Hadoop user group information is configured through hadoop.security.group.mapping, which defaults to system user group (ShellBasedUnixGroupsMapping), HDFS respects the user group information on the NameNode, and can be refreshed with hdfs dfsadmin -refreshUserToGroupsMappings.

JuiceFS directly uses the user / group information on client node and doesn't need refreshing. If user / group information is modified on one node, you must sync changes to all nodes that's using JuiceFS and restart all relevant services, to avoid this toil, we recommend the below practice to achieve dynamic user / group configuration.

User / group information is configurable for JuiceFS Hadoop SDK, it's best to put the configuration file directly within JuiceFS so that all clients can use the same config. And when changes are introduced, all JuiceFS clients will automatically update within 2 minutes, no services restart is required. Follow these steps to set up:

Modify core-site.xml on all nodes and add the following configuration:

<property>
  <name>juicefs.grouping</name>
  <value>jfs://{VOL_NAME}/etc/group</value>
</property>

Create the actual config file that juicefs.grouping points to on JuiceFS.
Add desired modification to config file, using this format:
```
groupname:username1,username2
```

Memory usage

Depending on the read and write load of computing tasks (such as Spark executor), JuiceFS Hadoop Java SDK may require an additional 4 * juicefs.memory-size off-heap memory to speed up read and write performance. By default, it is recommended to configure at least 1.2GB of off-heap memory for compute tasks.

hflush implementation

When data is committed into HDFS, hflush is called, this process is different between HDFS and JuiceFS:

HDFS submits data writes to datanode memory, resulting in a lower latency. Upon completion, data change is visible to all other clients.

For JuiceFS, different modes are supported for hflush, configured via juicefs.hflush:

writeback (default)

Under writeback mode, calling hflush will only write data to local disk, which are asynchronously uploaded to object storage. If client exits abnormally and never finishes uploading, data will be lost.

If writeback is enabled, you can also configure juicefs.hflush-delay to control hflush interval, which is default to 0. Means write data to local disk immediately.
sync

Calling hflush will synchronously upload data to object storage, and inherently has longer delay, but comes with stronger data consistency.

Cache

JuiceFS has excellent cache capabilities and is thoroughly introduced in Cache, so this chapter only covers Hadoop related practices. And just like JuiceFS Linux Client, cache is configurable for Hadoop SDK, see configurations.

Cache directory

By default, the Hadoop SDK uses memory as the cache directory. Because the memory capacity is often limited, and too much memory usage can easily cause OOM problems. Therefore, it is recommended to change the cache directory (i.e. juicefs.cache-dir configuration) to the local disk (such as SSD disk) path in the production environment, and adjust the cache size appropriately (i.e. juicefs.cache-size configuration).

Local cache

In order to better utilize host level cache, JuiceFS marks every logical data block (128MB, same as HDFS Block, i.e. the dfs.blocksize configuration) with a BlockLocation info, this tells computation engines like Spark / MapReduce / Presto to schedule jobs onto the same host, which really maximize cache usage. BlockLocation is calculated by building a consistent hashing ring on computing nodes, so that adding / removing nodes from the cluster only introduces small impact on cache.

So in order to calculate BlockLocation, you need to configure juicefs.discover-nodes-url so that JuiceFS can obtain HDFS node list from corresponding API provided by engines like YARN, Spark, Presto.

Also, if distributed cache is enabled, this node list is also used as a whitelist for cache group members, preventing nodes without cache capabilities join the cache group.

Distributed cache and dedicated cache cluster

Read "Distributed Cache" first for more information.

In Hadoop SDK, use juicefs.cache-group to enable distributed cache, nodes within the same cache group will share cache data among each other. This works well when JuiceFS is mounted to a "fixed" location, like Presto, Spark Thrift Server.

But for computation clusters that's constantly scaling up and down, like Spark on Kubernetes, it's best to use a dedicated cache cluster. To set this up, make sure juicefs.cache-group is the same between two clusters, and set juicefs.no-sharing to true for computation clusters, this forbids computation cluster from forming into the same cache group (only ask for cache, never give).

Getting Started

Refer to installation and following chapters to start using JuiceFS on Hadoop.

Quick start​

User / Group administration​

Memory usage​

hflush implementation​

Cache​

Cache directory​

Local cache​

Distributed cache and dedicated cache cluster​

Getting Started​