Skip to main content

Accessing JuiceFS using a Java client in Hadoop

Hadoop

Juicefs provides HDFS compatible Java clients to support a variety of applications in the Hadoop ecosystem.

Notice: Juicefs Java clients will not create under objective storage bucket automatically。Please create the bucket first.

Caution

Juicefs and HDFS are slightly different in terms of user group management. Hadoop user group information is configurable (hadoop.security.group.mapping) , using system user group information (ShellBasedUnixGroupsMapping) by default, when:

HDFS uses user group information for the node on which the NameNode resides and refreshes the cache with an administrative command (hdfs dfsadmin -refreshUserToGroupsMappings) .

Juicefs uses the user group information of the node in which each client was created and can not be refreshed. When you want to modify user group information, you need to synchronize to all nodes that deployed JuiceFS and restart services that use JuiceFS .

For example, add a yarn user to a group named supergroup, you need to run the following command on each machine:

groupadd supergroup
usermod -a -G supergroup yarn

Preparations

The JuiceFS client needs to access the JuiceFS metadata service through the external network, and needs to deploy NAT to provide external network access to the Hadoop cluster.

DESCRIPTION: The JuiceFS metadata service is deployed in the same public cloud and the same region (possibly different available zones) as the customer, and the client accesses the services in the region through an extranet. If deployed on a customer intranet, no such requirement exists.

Automatically Install JuiceFS Java Client

Mount JuiceFS Filesystem

See: Mount JuiceFS Filesystem

Python Script Installation

INSTALL SCRIPT: setup-hadoop.py , run the script as follows on the ClouerManager node, Ambari node, or EMR Master node on the public cloud.

This script is written in Python and supports Python2 and Python3.

When you run the script, the JuiceFS Java client(juicefs-hadoop.jar) will be downloaded and installed in your entire cluster, as well as automatically deploy related configurations to core-site.xml (automatically configuration supports HDP and CDH distribution only).

Usage

python setup-hadoop.py COMMAND

COMMAND could be:
-h show help information
install install JuiceFS JAR file on the current node
install_all install JuiceFS JAR file on all nodes, need SSH connected via root user
config auto fill configuration, support Ambari and ClouderaManager only
deploy_config auto deloy configuration to all nodes, support Ambari and ClouderaManager only
test check if installation is correct on the current node
test_all check if installation is correct on all nodes

Installation Guide

  1. Install the JAR File
python setup-hadoop.py install

juicefs-hadoop.jar will be downloaded to /user/local/lib and create a softlink in the /lib for each component of the Hadoop distribution.

The location detail will be printed in the log.

  1. Configuration
python setup-hadoop.py config

This operation writes the required configuration items to the core-site.xml file of HDFS, and print it out in the log.

For CDH and HDP environments, run this command and follow the prompt to enter the administrator’s password. If the juicefs auth or juicefs mount is executed successfully on this node, authentication information in /root/.juicefs/ should read automatically, and then write configuration to core-site.xml file via RESTful API.

This operation will also create a cache directory on the node follow the jucefs.cache-dir configured.

  1. Distribute the JAR file to the cluster
  • For CDH or HDP distribution, run:

    python setup-hadoop.py install_all
  • For EMR of the public cloud, run:

    export NODE_LIST=node1,node2
    python setup-hadoop.py install_all
  • For Apache community edition, run:

    # the classpath of each components, seperate by comma
    export EXTRA_PATH=$HADOOP_HOME/share/hadoop/common/lib,$SPARK_HOME/jars,$PRESTO_HOME/plugin/hive-hadoop2
    export NODE_LIST=node1,node2
    python setup-hadoop.py install_all

This operation will copy juicefs-hadoop.jar and /etc/juicefs/.juicefs to the specified node via scp.

Installation by Ansible

Ansible configuration template: setup-hadoop.yml

Ansible hosts configuration

[master] # ssh connected between master and slave if the node has external network
master001

[slave] # all nodes in the Hadoop cluster, exclude master nodes
slave001
slave002

Usage

ansible setup-hadoop.yml \
--extra-vars '{"jfs_name":"your-jfs-name", "jfs_version":"0.5", "hadoop":"your-hadoop-dist", "cache_dir":["/data01/jfscache","/data02/jfscache"]}'

Parameters:
jfs_name JuiceFS volume name
jfs_version JuiceFS Java client version
hadoop Hadoop distribution name, support: cdh5, cdh6, hdp, emr-ali, emr-tencent, kmr, uhadoop
cache_dir Cache directory on the current node

Manually Install JuiceFS Java Client

  1. Download the latest juicefs-hadoop-4.5.4.jar
  2. Add the JAR file to the classpath for each Hadoop component, you could lookup by hadoop classpath command.

P.S. Some components have directories where JAR files are placed by default, which are automatically added to the classpath.

JuiceFS Java Client Parameter Reference

Add the following parameters to the core-site.xml of Hadoop.

Basic Configuration

ItemDefault ValueDescription
fs.jfs.implcom.juicefs.JuiceFileSystemSpecify the implementation of the storage type jfs://.
fs.AbstractFileSystem.jfs.implcom.juicefs.JuiceFSSpecify the implementation of AbstractFileSystem for MapReduce.
juicefs.tokenCredential for JuiceFS volume, checkout from the setting page of JuiceFS web console.
juicefs.accesskeyAccess Key for object store. It doesn’t need if the node has priviledge to access the object store.
juicefs.secretkeySecret Key for object store. It doesn’t need if the node has priviledge to access the object store.

Cache Configuration

ItemDefault ValueDescription
juicefs.cache-dir/var/jfsCacheCache directory, you could specify more than one folder, seperate by colon, or use wildcards *. Typically, the component doesn’t have permission to create these diectories, you need to create manully and give 0777 permission for components could share data in these folders.
juicefs.cache-size10240Cache capacity (unit MB), if multiple diectories are configured, this is total capacity for all cache folders.
juicefs.cache-replica1Number of cache replica.
juicefs.memory-size300Maximum memory used for read-ahead, unit MB.
juicefs.auto-create-cache-dirtrueWhether to create cache directories automatically. When set to false, it will ignore the a cache directory that doesn’t exist.
juicefs.free-space0.2Minimum free space ratio. When free space is under the ratio, it will clear the cache to free disk space, defaut 20%.
juicefs.metacachetrueEnable metadata cache.
juicefs.discover-nodes-urlSpecify the node list of cluster, refresh every 10 minutes. If you use YARN manage resource, nodes could be discovered by RESTful API (seperate by comma): http://resourmanager1:8088/ws/v1/cluster/nodes/. You could also give a path of the configuration file, e.g file:///etc/nodes or jfs://your-jfs-name/etc/nodes. The configuration file must be plain text, one line for a hostname.

Other Configuration

ItemDefault ValueDescription
juicefs.access-logThe filepath for access log (e.g /tmp/juicefs.access.log), read and write permission is required for every Hadoop component. Log file will rotate automatically, and retain last 7 files.
juicefs.debugfalseEnable DEBUG level log.
juicefs.max-uploads50Maximum number of concurrency for writing object store.
juicefs.superuserhdfsSpecify the superuser
juicefs.bucketSpecify the bucket name of object store.
juicefs.upload-limit0Speed limit for writing object store by a single process, units Byte/second.
juicefs.object-timeout5Timeout for visiting object store
juicefs.externalfalseUsing external domain to access object store
juicefs.rsaPrivKeyPathThe file path of RSA Private Key for data encryption.
juicefs.rsaPassphaseThe passphrase of RSA Key for data encryption.
juicefs.file.checksumfalseEnable checksum for copying data via Hadoop DistCp

When using multiple JuiceFS filesystems, all of above items can be specified for a single filesystem, the filesystem name JFS_NAME should to be placed in the middle of the configuration item, such as:

<property>
<name>juicefs.{JFS_NAME}.debug</name>
<value>false</value>
</property>

Configuration Guide for Hadoop Distribution

Finish Automatically Install JuiceFS Java Client, the following configuration is no longer required.

Cloudera Distribution Hadoop (CDH)

Following Manually Install JuiceFS Java Client, and then change the configuration of the components:

  • Hadoop

    CDH 5.x

    Modify core-site.xml in the ClouderaManager:

    Configuration detail at Parameter Reference

    CDH 6.x

    In addition to the above for 5.x version. You also need to add the following path to mapreduce.application.classpath at YARN service interface:

    $HADOOP_COMMON_HOME/lib/juicefs-hadoop.jar
  • HBase

    Modify hbase-site.xml at HBase service interface:

    Add the following lines:

    <property>
    <name>hbase.rootdir</name>
    <value>jfs://{JFS_NAME}/hbase</value>
    </property>
    <property>
    <name>hbase.wal.dir</name>
    <value>hdfs://your-hdfs-uri/hbase-wal</value>
    </property>

    Delete znode(default /hbase) zookeeper.znode.parent

    Notice:this operation will empty HBase data

  • Hive

    Modify hive.metastore.warehouse.dir at Hive service interface,change the default location for table creation in Hive(optional):

    jfs://your-jfs-name/your-warehouse-dir
  • Impala

    Modify Impala Commandline Advanced Configuration at Impala service interface.

    For increase the number of I/O threads, you could change the following parameter as 20/Number-of-Mounted-Disks.

    -num_io_threads_per_rotational_disk=4
  • Solr

    Modify Solr Advanced COnfiguration Code Snippets at Solr service interface:

    hdfs_data_dir=jfs://your-jfs/solr

Finally, restart the cluster to take effect changes.

Hortonworks Distribution Platform(HDP)

Following the instruction Manually Install JuiceFS Java Client, and then modify the configuration at Ambari:

  • Hadoop

    Modify core-site.xml at HDFS interface, following Parameter Configuration.

  • MapReduce2

    Modify mapreduce.application.classpath at MapReduce2 interface, append :/usr/hdp/${hdp.version}/hadoop/lib/juicefs-hadoop.jar at the last(don’t replace the variable)

  • Hive

    Modify hive.metastore.warehouse.dir for default location of table creation at Hive interface(optional):

    jfs://your-jfs-name/your-warehouse-dir

    If you have the Ranger service, please append jfs at the last of ranger.plugin.hive.urlauth.filesystem.schemes:

    ranger.plugin.hive.urlauth.filesystem.schemes=hdfs:,file:,wasb:,adl:,jfs:
  • Druid

    Change the directory address at Driud interface (Maybe you need to create the directory manually due to permission limit) :

    "druid.storage.storageDirectory": "jfs://your-jfs-name/apps/druid/warehouse"
    "druid.indexer.logs.directory": "jfs://your-jfs-name/user/druid/logs"
  • HBase

    Modify these params at HBase

    hbase.rootdir=jfs://your-jfs-name/hbase
    hbase.wal.dir=hdfs://your-hdfs-uri/hbase-wal

    Delete znode(default /hbase) zookeeper.znode.parent

    Notice:this operation will empty HBase data

  • Sqoop

    When import data into Hive by Sqoop, Sqoop should imports data into target-dir first, and then loads into the Hive table with hive load, so you need to modify target-dir when using Sqoop.

    1.4.6

    For version 1.4.6, You need modify fs to update the default filesystem. Please copy mapreduce.tar.gz to the same path from HDFS to JuiceFS, defalt path is /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz.

    sqoop import \
    -fs jfs://your-jfs-name/ \
    --target-dir jfs://your-jfs-name/tmp/your-dir

    1.4.7

    sqoop import \
    --target-dir jfs://your-jfs-name/tmp/your-dir

Finally, restart the service to take effect changes.

Aliyun EMR

Following Manually Install JuiceFS Java Client to setup the JAR file, and modify configuration at EMR interface.

  • Hadoop

    Modify core-site.xml at HDFS interface, following Parameter Configuration.

  • Hive

    Modify hive_aux_jars_path in hive-env at Hive interface:

    hive_aux_jars_path=/usr/local/lib/juicefs-hadoop.jar

    Modify hive.metastore.warehouse.dir for default location of table creation at Hive interface(optional):

    jfs://your-jfs-name/your-warehouse-dir

Finally, restart the service to take effect changes.

Tencent Cloud EMR

Following Manually Install JuiceFS Java Client to setup the JAR file, and modify configuration at EMR interface.

  • Hadoop

    Modify core-site.xml at HDFS interface, following Parameter Configuration.

    Caution:

    1. The format of juicefs.bucket is juicefs-{your_jfs_name}-{APPID}, you could find out the APPID at the account information of the console of Tencent Cloud.
    2. Specify juicefs.superuser=hadoop, because Tencent Cloud EMR runs HDFS with hadoop user by default.
  • Hive

    Modify hive.metastore.warehouse.dir for default location of table creation at Hive interface(optional):

    jfs://your-jfs-name/your-warehouse-dir

Finally, restart the service to take effect changes.

Kingsoft Cloud KMR

Please setup as same as Hortonworks Distribution Platform(HDP) Configuration.

Amazon Web Service(AWS) EMR

Following Manually Install JuiceFS Java Client to setup the JAR file, and modify configuration at EMR interface.

  • Hadoop

    Modify core-site.xml at HDFS interface, following Parameter Configuration.

  • HBase

    Modify hbase-site

    "hbase.rootdir": "jfs://your-jfs/hbase"

    Modify hbase

    "hbase.emr.storageMode": "jfs"

After the configuration of EMR is refreshed, restart the service which you changed.

UCloud UHadoop

Following Manually Install JuiceFS Java Client to setup the JAR file, and modify configuration at EMR interface.

Modify core-site.xml following Parameter Configuration.

Finally, restart the cluster to take effect changes.

Other Components

Deploy JuiceFS Java Client in Hadoop environment first.

Apache Spark

Modify conf/spark-env.sh, add following lines at the front:

export HADOOP_HOME=YOUR_HADOOP_HOME
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_DIST_CLASSPATH=`hadoop classpath`

Modify bin/config.sh, add following lines at the front:

export HADOOP_HOME=YOUR_HADOOP_HOME
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CLASSPATH=`hadoop classpath`

Presto

Copy juicefs-hadoop.jar to plugin/hive-hadoop2 folder.

DataX

  • Copy juicefs-hadoop.jar to datax/plugin/writer/hdfswriter/libs folder.

  • Modify the configuration file of DataX:

    "defaultFS": "jfs://your-jfs-name",
    "hadoopConfig": {
    "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
    "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
    "juicefs.access-log": "/tmp/juicefs.access.log",
    "juicefs.token": "xxxxxxxxxxxxx",
    "juicefs.accesskey": "xxxxxxxxxxxxx",
    "juicefs.secretkey": "xxxxxxxxxxxxx"
    }

Deployment Validation

Hadoop

hadoop fs -ls jfs://${JFS_NAME}/
hadoop fs -mkdir jfs://${JFS_NAME}/jfs-test
hadoop fs -rm -r jfs://${JFS_NAME}/jfs-test

Hive, SparkSQL, Impala

create table if not exists person(
name string,
age int
)

location 'jfs://${JFS_NAME}/tmp/person';
insert into table person values('tom',25);
insert overwrite table person select name, age from person;
select name, age from person;
drop table person;

HBase

create 'test', 'cf'
list 'test'
put 'test', 'row1', 'cf:a', 'value1'
scan 'test'
get 'test', 'row1'
disable 'test'
drop 'test'

Flume

jfs.sources =r1
jfs.sources.r1.type = org.apache.flume.source.StressSource
jfs.sources.r1.size = 10240
jfs.sources.r1.maxTotalEvents=10
jfs.sources.r1.batchSize=10
jfs.sources.r1.channels = c1

jfs.channels = c1
jfs.channels.c1.type = memory
jfs.channels.c1.capacity = 100
jfs.channels.c1.transactionCapacity = 100

jfs.sinks = k1
jfs.sinks.k1.type = hdfs
jfs.sinks.k1.channel = c1
jfs.sinks.k1.hdfs.path =jfs://${JFS_NAME}/tmp/flume
jfs.sinks.k1.hdfs.writeFormat= Text
jfs.sinks.k1.hdfs.fileType= DataStream
echo 'hello world' > /tmp/jfs_test
hadoop fs -put /tmp/jfs_test jfs://${JFS_NAME}/tmp/
rm -f /tmp/jfs_test
./bin/flink run -m yarn-cluster ./examples/batch/WordCount.jar --input jfs://${JFS_NAME}/tmp/jfs_test --output jfs://${JFS_NAME}/tmp/result

Migrate data to JuiceFS

Hive, Spark, Impala

Data from data warehouse such as Hive can be stored on different storage systems. You can copy data from HDFS to JuiceFS via distcp and update the data location of the table with alter table.

Data from different partitions of the same table can also be stored in different storage systems, you could change the program to write new partition in JuiceFS, and copy existed partition to JuiceFS via distcp, update the partition location at last.

HBase

Preparation

Assume you have a new HBase cluster with same configuration.

Offline Migration

Close all tables in Original HBase Cluster:

bin/disable_all_tables.sh

In the New JuiceFS HBase Cluster:

  • Copy data via distcp:

    sudo -u hbase hadoop distcp -Dmapreduce.map.memory.mb=1500 your_filesystem/hbase/data jfs://your_jfs/hbase/data
  • Remove the Metadata of HBase

    sudo -u hbase hadoop fs -rm -r jfs://your_jfs/hbase/data/hbase
  • Repair the Metadata

    sudo -u hbase hbase hbck -fixMeta -fixAssignments

Online Migration

The next step is to import the historical data into the new JuiceFS HBase cluster and stay in sync with the original HBase cluster. Once the data is synchronized, the business can switch to the new JuiceFS HBase cluster.

In the Original HBase Cluster:

  • Turn on the replication:

    add_peer '1', CLUSTER_KEY => "jfs_zookeeper:2181:/hbase",TABLE_CFS => { "your_table" => []}
    enable_table_replication 'your_table'

    Tables will be created automatically(include splits).

  • Turn off the replication:

    disable_peer("1")
  • Create a snapshot:

    snapshot 'your_table','your_snapshot'

In the New JuiceFS HBase Cluster,

  • Import the snapshot:

    sudo -u hbase hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
    -snapshot your_snapshot \
    -copy-from s3://your_s3/hbase \
    -copy-to jfs://your_jfs/hbase \
    -mappers 1 \
    -bandwidth 20

    You could adjust map number and bandwidth(unit MB/s).

  • Restore the snapshot:

    disable 'your_table'
    restore_snapshot 'your_snapshot'
    enable 'your_table'

Turn on the replication in the Original HBase Cluster:

enable_peer("1")