Using JuiceFS in Hadoop
For users that are migrating big data infrastructure from IDC to public cloud services, JuiceFS is a great alternative or supplement for HDFS, providing better efficiency while lowering cost. Read "From Hadoop to cloud native — Separation of compute and storage for big data platform" for more information.
If you decide to adopt JuiceFS in your Hadoop infrastructure, JuiceFS Hadoop SDK is the recommended way. In theory you can use RawLocalFileSystem
to access JuiceFS mount point, but due to HDFS client implementation, this practice comes with performance issues and is not recommended.
Before getting started with Hadoop SDK, read this chapter for an introduction.
Quick start
The following is a video tutorial on how to use the JuiceFS Hadoop SDK in a Hadoop environment using Amazon EMR as an example:
User / Group administration
Hadoop user group information is configured through hadoop.security.group.mapping
, which defaults to system user group (ShellBasedUnixGroupsMapping
), HDFS respects the user group information on the NameNode, and can be refreshed with hdfs dfsadmin -refreshUserToGroupsMappings
.
JuiceFS directly uses the user / group information on client node and doesn't need refreshing. If user / group information is modified on one node, you must sync changes to all nodes that's using JuiceFS and restart all relevant services, to avoid this toil, we recommend the below practice to achieve dynamic user / group configuration.
User / group information is configurable for JuiceFS Hadoop SDK, it's best to put the configuration file directly within JuiceFS so that all clients can use the same config. And when changes are introduced, all JuiceFS clients will automatically update within 2 minutes, no services restart is required. Follow these steps to set up:
-
Modify
core-site.xml
on all nodes and add the following configuration:<property>
<name>juicefs.grouping</name>
<value>jfs://{VOL_NAME}/etc/group</value>
</property> -
Create the actual config file that
juicefs.grouping
points to on JuiceFS. -
Add desired modification to config file, using this format:
groupname:username1,username2
Memory usage
Depending on the read and write load of computing tasks (such as Spark executor), JuiceFS Hadoop Java SDK may require an additional 4 * juicefs.memory-size
off-heap memory to speed up read and write performance. By default, it is recommended to configure at least 1.2GB of off-heap memory for compute tasks.
hflush implementation
When data is committed into HDFS, hflush
is called, this process is different between HDFS and JuiceFS:
HDFS submits data writes to datanode memory, resulting in a lower latency. Upon completion, data change is visible to all other clients.
For JuiceFS, different modes are supported for hflush
, configured via juicefs.hflush
:
-
writeback
(default)Under
writeback
mode, callinghflush
will only write data to local disk, which are asynchronously uploaded to object storage. If client exits abnormally and never finishes uploading, data will be lost.If
writeback
is enabled, you can also configurejuicefs.hflush-delay
to controlhflush
interval, which is default to 0. Means write data to local disk immediately. -
sync
Calling
hflush
will synchronously upload data to object storage, and inherently has longer delay, but comes with stronger data consistency.
Cache
JuiceFS has excellent cache capabilities and is thoroughly introduced in Cache, so this chapter only covers Hadoop related practices. And just like JuiceFS Linux Client, cache is configurable for Hadoop SDK, see configurations.
Cache directory
By default, the Hadoop SDK uses memory as the cache directory. Because the memory capacity is often limited, and too much memory usage can easily cause OOM problems. Therefore, it is recommended to change the cache directory (i.e. juicefs.cache-dir
configuration) to the local disk (such as SSD disk) path in the production environment, and adjust the cache size appropriately (i.e. juicefs.cache-size
configuration).
Local cache
In order to better utilize host level cache, JuiceFS marks every logical data block (128MB, same as HDFS Block, i.e. the dfs.blocksize
configuration) with a BlockLocation
info, this tells computation engines like Spark / MapReduce / Presto to schedule jobs onto the same host, which really maximize cache usage. BlockLocation
is calculated by building a consistent hashing ring on computing nodes, so that adding / removing nodes from the cluster only introduces small impact on cache.
So in order to calculate BlockLocation
, you need to configure juicefs.discover-nodes-url
so that JuiceFS can obtain HDFS node list from corresponding API provided by engines like YARN, Spark, Presto.
Also, if distributed cache is enabled, this node list is also used as a whitelist for cache group members, preventing nodes without cache capabilities join the cache group.
Distributed cache and dedicated cache cluster
Read "Distributed Cache" first for more information.
In Hadoop SDK, use juicefs.cache-group
to enable distributed cache, nodes within the same cache group will share cache data among each other. This works well when JuiceFS is mounted to a "fixed" location, like Presto, Spark Thrift Server.
But for computation clusters that's constantly scaling up and down, like Spark on Kubernetes, it's best to use a dedicated cache cluster. To set this up, make sure juicefs.cache-group
is the same between two clusters, and set juicefs.no-sharing
to true
for computation clusters, this forbids computation cluster from forming into the same cache group (only ask for cache, never give).
Getting Started
Refer to installation and following chapters to start using JuiceFS on Hadoop.