For users that are migrating big data infrastructure from IDC to public cloud services, JuiceFS is a great alternative or supplement for HDFS, providing better efficiency while lowering cost. Read "From Hadoop to cloud native — Separation of compute and storage for big data platform" for more information.
If you decide to adopt JuiceFS in your Hadoop infrastructure, JuiceFS Hadoop SDK is the recommended way. In theory you can use
RawLocalFileSystem to access JuiceFS mount point, but due to HDFS client implementation, this practice comes with performance issues and is not recommended.
Before getting started with Hadoop SDK, read this chapter for an introduction.
The following is a video tutorial on how to use the JuiceFS Hadoop SDK in a Hadoop environment using Amazon EMR as an example:
User / Group administration
Hadoop user group information is configured through
hadoop.security.group.mapping, which defaults to system user group (
ShellBasedUnixGroupsMapping), HDFS respects the user group information on the NameNode, and can be refreshed with
hdfs dfsadmin -refreshUserToGroupsMappings.
JuiceFS directly uses the user / group information on client node and doesn't need refreshing. If user / group information is modified on one node, you must sync changes to all nodes that's using JuiceFS and restart all relevant services, to avoid this toil, we recommend the below practice to achieve dynamic user / group configuration.
User / group information is configurable for JuiceFS Hadoop SDK, it's best to put the configuration file directly within JuiceFS so that all clients can use the same config. And when changes are introduced, all JuiceFS clients will automatically update within 2 minutes, no services restart is required. Follow these steps to set up:
core-site.xmlon all nodes and add the following configuration:
Create the actual config file that
juicefs.groupingpoints to on JuiceFS.
Add desired modification to config file, using this format:
Depending on the read and write load of computing tasks (such as Spark executor), JuiceFS Hadoop Java SDK may require an additional 4 *
juicefs.memory-size off-heap memory to speed up read and write performance. By default, it is recommended to configure at least 1.2GB of off-heap memory for compute tasks.
When data is committed into HDFS,
hflush is called, this process is different between HDFS and JuiceFS:
HDFS submits data writes to datanode memory, resulting in a lower latency. Upon completion, data change is visible to all other clients.
For JuiceFS, different modes are supported for
hflush, configured via
hflushwill only write data to local disk, which are asynchronously uploaded to object storage. If client exits abnormally and never finishes uploading, data will be lost.
writebackis enabled, you can also configure
hflushinterval, which is default to 0. Means write data to local disk immediately.
hflushwill synchronously upload data to object storage, and inherently has longer delay, but comes with stronger data consistency.
JuiceFS has excellent cache capabilities and is thoroughly introduced in Cache, so this chapter only covers Hadoop related practices. And just like JuiceFS Linux Client, cache is configurable for Hadoop SDK, see configurations.
By default, the Hadoop SDK uses memory as the cache directory. Because the memory capacity is often limited, and too much memory usage can easily cause OOM problems. Therefore, it is recommended to change the cache directory (i.e.
juicefs.cache-dir configuration) to the local disk (such as SSD disk) path in the production environment, and adjust the cache size appropriately (i.e.
In order to better utilize host level cache, JuiceFS marks every logical data block (128MB, same as HDFS Block, i.e. the
dfs.blocksize configuration) with a
BlockLocation info, this tells computation engines like Spark / MapReduce / Presto to schedule jobs onto the same host, which really maximize cache usage.
BlockLocation is calculated by building a consistent hashing ring on computing nodes, so that adding / removing nodes from the cluster only introduces small impact on cache.
So in order to calculate
BlockLocation, you need to configure
juicefs.discover-nodes-url so that JuiceFS can obtain HDFS node list from corresponding API provided by engines like YARN, Spark, Presto.
Also, if distributed cache is enabled, this node list is also used as a whitelist for cache group members, preventing nodes without cache capabilities join the cache group.
Distributed cache and dedicated cache cluster
Read "Distributed Cache" first for more information.
In Hadoop SDK, use
juicefs.cache-group to enable distributed cache, nodes within the same cache group will share cache data among each other. This works well when JuiceFS is mounted to a "fixed" location, like Presto, Spark Thrift Server.
But for computation clusters that's constantly scaling up and down, like Spark on Kubernetes, it's best to use a dedicated cache cluster. To set this up, make sure
juicefs.cache-group is the same between two clusters, and set
true for computation clusters, this forbids computation cluster from forming into the same cache group (only ask for cache, never give).
Refer to installation and following chapters to start using JuiceFS on Hadoop.