Skip to main content

Installation

  1. Network condition requirements:
    • For JuiceFS Cloud Service, Hadoop SDK needs to access metadata service and console through public internet, you'll probably need to deploy a NAT to provide public internet access to the Hadoop cluster.
    • For JuiceFS on-premise deployment, metadata service and console are already deployed within the VPC, so Hadoop nodes need to have network access to metadata service and console.
  2. JuiceFS Hadoop Java SDK does not create object storage bucket automatically, you should create them in advance.
  3. Download the latest juicefs-hadoop.jar, this package will be used in later stages.

Common Hadoop Distribution

If you're already using these common Hadoop Distributions, refer to below installation steps.

Cloudera Distribution Hadoop (CDH)

Install Using Parcel

  1. On Cloudera Manager node, download Parcel and CSD, put CSD into /opt/cloudera/csd, and extract Parcel into /opt/cloudera/parcel-repo.

  2. Restart Cloudera Manager

    service cloudera-scm-server restart
  3. Activate Parcel

    Open Cloudera Manager Admin Console → Hosts → Check for New Parcels → JuiceFS → Distribute → Active

  4. Add Service

    Open Cloudera Manager Admin Console → Cluster → Add Service → JuiceFS → Choose node

  5. Deploy JAR File

  6. Upgrade

    To upgrade, download the latest version of the Parcel file using above steps, and then activate using step 3.

Configure in Cloudera Manager

  • Hadoop

    CDH 5.x

    Modify core-site.xml in the ClouderaManager:

    Example:

    fs.jfs.impl=com.juicefs.JuiceFileSystem
    fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS
    juicefs.cache-size=10240
    juicefs.cache-dir=xxxxxx
    juicefs.cache-group=yarn
    juicefs.discover-nodes-url=yarn
    juicefs.accesskey=xxxxxx
    juicefs.secretkey=xxxxxx
    juicefs.token=xxxxxx
    juicefs.access-log=/tmp/juicefs.access.log

    CDH 6.x

    Follow above steps for 5.x version, in addition you also need to add the following path to mapreduce.application.classpath at YARN service interface:

    $HADOOP_COMMON_HOME/lib/juicefs-hadoop.jar
  • HBase

    Modify hbase-site.xml at HBase service interface:

    Add the following lines:

    <property>
    <name>hbase.rootdir</name>
    <value>jfs://{VOL_NAME}/hbase</value>
    </property>
    <property>
    <name>hbase.wal.dir</name>
    <value>hdfs://your-hdfs-uri/hbase-wal</value>
    </property>

    Delete znode (default /hbase) zookeeper.znode.parent, this operation will delete all data in the original HBase cluster.

  • Hive

    Modify hive.metastore.warehouse.dir at Hive service interface, change the default location for table creation in Hive(optional):

    jfs://myjfs/your-warehouse-dir
  • Impala

    Modify Impala Command Line Advanced Configuration at Impala service web UI.

    To increase the number of I/O threads, you could change the following parameter as 20 / Number-of-Mounted-Disks.

    -num_io_threads_per_rotational_disk=4
  • Solr

    Modify Solr Advanced Configuration Code Snippets at Solr service web UI:

    hdfs_data_dir=jfs://myjfs/solr

Finally, restart the cluster to take effect.

Hortonworks Distribution Platform(HDP)

Integrate JuiceFS to Ambari

  1. Download HDP, unzip to /var/lib/ambari-server/resources/stacks/HDP/{YOUR-HDP-VERSION}/services

  2. Restart Ambari

    systemctl restart ambari-server
  3. Add JuiceFS Service

    Ambari Management Console → Services → Add Service → JuiceFS → Choose node → Configure → Deploy

    In the Configure step, pay attention to cache_dirs and download_url and change according to your environment.

    If Ambari has no access to internet, put the downloaded JAR file into share_download_dir, which defaults to the HDFS /tmp directory.

  4. Upgrade JuiceFS

    Change the version string in download_url, save and refresh configuration.

Configure in Ambari

  • Hadoop

    Modify core-site.xml at HDFS interface, see Configuration.

  • MapReduce2

    At MapReduce2 interface, add the literal string :/usr/hdp/${hdp.version}/hadoop/lib/juicefs-hadoop.jar to mapreduce.application.classpath.

  • Hive

    Optionally modify hive.metastore.warehouse.dir for default location of table creation at Hive interface:

    jfs://myjfs/your-warehouse-dir

    If Ranger is available, append jfs at the last of ranger.plugin.hive.urlauth.filesystem.schemes:

    ranger.plugin.hive.urlauth.filesystem.schemes=hdfs:,file:,wasb:,adl:,jfs:
  • Druid

    Change the directory address at Druid interface (May need to manually create due to permission):

    "druid.storage.storageDirectory": "jfs://myjfs/apps/druid/warehouse"
    "druid.indexer.logs.directory": "jfs://myjfs/user/druid/logs"
  • HBase

    Modify these parameters at HBase

    hbase.rootdir=jfs://myjfs/hbase
    hbase.wal.dir=hdfs://your-hdfs-uri/hbase-wal

    Delete znode (default /hbase) zookeeper.znode.parent, this operation will delete all data in the original HBase cluster.

  • Sqoop

    When import data into Hive by Sqoop, Sqoop should import data into target-dir first, and then load into the Hive table with hive load, so you need to modify target-dir when using Sqoop.

    • 1.4.6

      For version 1.4.6, You need modify fs to update the default filesystem. Please copy mapreduce.tar.gz to the same path from HDFS to JuiceFS, default path is /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz.

      sqoop import \
      -fs jfs://myjfs/ \
      --target-dir jfs://myjfs/tmp/your-dir
    • 1.4.7

      sqoop import \
      --target-dir jfs://myjfs/tmp/your-dir

Finally, restart the service to take effect.

Amazon Web Service(AWS) EMR

Integrate JuiceFS in a new EMR Cluster

  1. Fill in the configuration in Software and Steps UI

    • Basic config

      [
      {
      "classification": "core-site",
      "properties": {
      "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
      "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
      "juicefs.cache-size": "10240",
      "juicefs.access-log": "/tmp/juicefs.access.log",
      "juicefs.discover-nodes-url": "yarn",
      "juicefs.conf-dir": "/etc/juicefs",
      "juicefs.cache-full-block": "false",
      "juicefs.token": "",
      "juicefs.cache-group": "yarn",
      "juicefs.cache-dir": "/mnt*/jfs"
      }
      }
      ]
    • HBase on JuiceFS

      Change juicefs.cache-size and juicefs.free-space to suit your actual environment. juicefs.token can be obtained in the JuiceFS web console.

      [
      {
      "classification": "core-site",
      "properties": {
      "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
      "juicefs.cache-size": "10240",
      "juicefs.access-log": "/tmp/juicefs.access.log",
      "juicefs.discover-nodes-url": "yarn",
      "juicefs.conf-dir": "/etc/juicefs",
      "juicefs.cache-full-block": "false",
      "juicefs.token": "",
      "juicefs.cache-group": "yarn",
      "juicefs.free-space": "0.3",
      "juicefs.cache-dir": "/mnt*/jfs",
      "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS"
      }
      },
      {
      "classification": "hbase-site",
      "properties": {
      "hbase.rootdir": "jfs: //{name}/hbase"
      }
      }
      ]
  2. Create bootstrap actions in General Cluster Settings

    Download emr-boot.sh and juicefs-hadoop.jar, upload them to S3.

    Fill in the script location for emr-boot.sh, and use the S3 address for juicefs-hadoop-{version}.jar for arguments.

    • With access to public internet

      --jar s3://{bucket}/resources/juicefs-hadoop-{version}.jar
    • Offline (On-premise)

      SDK needs to download JuiceFS config file to discover metadata service address, so in a offline environment, you need to manually prepare this file, which can be obtained from any host mounted with JuiceFS, path is /root/.juicefs/$VOL_NAME.conf.

      Upload this file to S3, and provide below arguments to bootstrap actions:

      --jar s3://{bucket}/{jar-path}
      --conf-file s3://{bucket}/{conf-file-path}

Integrate JuiceFS in an existing EMR cluster

On EMR Master, install Hadoop SDK, and then modify configuration in the EMR web UI:

  • Hadoop

    Modify core-site.xml in HDFS Service UI, common settings:

    fs.jfs.impl=com.juicefs.JuiceFileSystem
    fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS
    juicefs.cache-size=10240
    juicefs.cache-dir=xxxxxx
    juicefs.cache-group=yarn
    juicefs.discover-nodes-url=yarn
    juicefs.accesskey=xxxxxx
    juicefs.secretkey=xxxxxx
    juicefs.token=xxxxxx
    juicefs.access-log=/tmp/juicefs.access.log
  • HBase

    Modify hbase-site:

    "hbase.rootdir": "jfs://myjfs/hbase"

    Modify hbase:

    "hbase.emr.storageMode": "jfs"

Wait for EMR config refresh, and then restart affected services.

Hadoop applications

If you would like to install Hadoop SDK for specific applications, you'll need to put the juicefs-hadoop.jar (downloaded in previous steps) and $JAVA_HOME/lib/tools.jar files inside the application installation directory (listed down below).

NameInstallation directory
Hadoop${HADOOP_HOME}/share/hadoop/common/lib/
Hive${HIVE_HOME}/auxlib
Spark${SPARK_HOME}/jars
Presto${PRESTO_HOME}/plugin/hive-hadoop2
Flink${FLINK_HOME}/lib
Datax${DATAX_HOME}/plugin/writer/hdfswriter/libs
${DATAX_HOME}/plugin/reader/hdfsreader/libs/

Apache Spark

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, can be done via:

    • Modify core-site.xml

    • Passing below arguments in command line:

      spark-shell --master local[*] \
      --conf spark.hadoop.fs.jfs.impl=com.juicefs.JuiceFileSystem \
      --conf spark.hadoop.fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS \
      --conf spark.hadoop.juicefs.token=xxx \
      --conf spark.hadoop.juicefs.accesskey=xxx \
      --conf spark.hadoop.juicefs.secretkey=xxx \
      ...
    • Modifying $SPARK_HOME/conf/spark-defaults.conf

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, can be done via:

    • Modifying core-site.xml
    • Modifying flink-conf.yaml

Presto

  1. Install juicefs-hadoop.jar
  2. Add JuiceFS config via modifying core-site.xml

DataX

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, modify DataX configuration file:

    "defaultFS": "jfs://myjfs",
    "hadoopConfig": {
    "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
    "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
    "juicefs.access-log": "/tmp/juicefs.access.log",
    "juicefs.token": "xxx",
    "juicefs.accesskey": "xxx",
    "juicefs.secretkey": "xxx"
    }

Basic configurations

This section is currently only available in Chinese.

Verification

Use below steps to verify JuiceFS is working properly under Hadoop.

Hadoop

hadoop fs -ls jfs://${VOL_NAME}/
hadoop fs -mkdir jfs://${VOL_NAME}/jfs-test
hadoop fs -rm -r jfs://${VOL_NAME}/jfs-test

Hive, SparkSQL, Impala

create table if not exists person(
name string,
age int
)

location 'jfs://${VOL_NAME}/tmp/person';
insert into table person values('tom',25);
insert overwrite table person select name, age from person;
select name, age from person;
drop table person;

Spark Shell

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
val conf = sc.hadoopConfiguration
val p = new Path("jfs://${VOL_NAME}/")
val fs = p.getFileSystem(conf)
fs.listStatus(p)

HBase

create 'test', 'cf'
list 'test'
put 'test', 'row1', 'cf:a', 'value1'
scan 'test'
get 'test', 'row1'
disable 'test'
drop 'test'

Flume

jfs.sources =r1
jfs.sources.r1.type = org.apache.flume.source.StressSource
jfs.sources.r1.size = 10240
jfs.sources.r1.maxTotalEvents=10
jfs.sources.r1.batchSize=10
jfs.sources.r1.channels = c1

jfs.channels = c1
jfs.channels.c1.type = memory
jfs.channels.c1.capacity = 100
jfs.channels.c1.transactionCapacity = 100

jfs.sinks = k1
jfs.sinks.k1.type = hdfs
jfs.sinks.k1.channel = c1
jfs.sinks.k1.hdfs.path =jfs://${VOL_NAME}/tmp/flume
jfs.sinks.k1.hdfs.writeFormat= Text
jfs.sinks.k1.hdfs.fileType= DataStream
echo 'hello world' > /tmp/jfs_test
hadoop fs -put /tmp/jfs_test jfs://${VOL_NAME}/tmp/
rm -f /tmp/jfs_test
./bin/flink run -m yarn-cluster ./examples/batch/WordCount.jar --input jfs://${VOL_NAME}/tmp/jfs_test --output jfs://${VOL_NAME}/tmp/result

Restart relevant services

After Hadoop SDK has been installed or upgraded, relevant services need to be restarted for changes to take effect.

  • Services that doesn't require a restart:

    • HDFS
    • HUE
    • ZooKeeper
  • Restart these services if they are using JuiceFS, if HA has been configured, you can perform a rolling restart to avoid downtime.

    • YARN
    • Hive Metastore Server, HiveServer2
    • Spark ThriftServer
    • Spark Standalone
      • Master
      • Worker
    • Presto
      • Coordinator
      • Worker
    • Impala
      • Catalog Server
      • Daemon
    • HBase
      • Master
      • RegionServer
    • Flume

Best Practices

Hadoop

Hadoop Applications typically integrates with HDFS using the org.apache.hadoop.fs.FileSystem class, JuiceFS brings support to Hadoop through inheriting this class.

After Hadoop SDK has been set up, use hadoop fs to manage JuiceFS:

hadoop fs -ls jfs://{VOL_NAME}/

To drop the jfs:// scheme, set fs.defaultFS to the JuiceFS volume in core-site.xml:

<property>
<name>fs.defaultFS</name>
<value>jfs://{VOL_NAME}</value>
</property>
note

Changing defaultFS essentially makes all path (without scheme specification) to resolve to JuiceFS, this might cause unexpected problems and should be tested thoroughly.

Hive

Use the LOCATION clause to specify data location for Hive database / table.

  • Creating database / table

    CREATE DATABASE ... database_name
    LOCATION 'jfs://{VOL_NAME}/path-to-database';

    CREATE TABLE ... table_name
    LOCATION 'jfs://{VOL_NAME}/path-to-table';

    A Hive table is by default stored under the database location, so if the database is already stored on JuiceFS, all tables within also resides in JuiceFS.

  • Migrate database / table

    ALTER DATABASE database_name
    SET LOCATION 'jfs://{VOL_NAME}/path-to-database';

    ALTER TABLE table_name
    SET LOCATION 'jfs://{VOL_NAME}/path-to-table';

    ALTER TABLE table_name PARTITION(...)
    SET LOCATION 'jfs://{VOL_NAME}/path-to-partition';

    For Hive, data can be stored on multiple file systems. For an unpartitioned table, all data must be located in one file system. For partitioned table, you can configure file systems for each single partition.

    To use JuiceFS as the default storage for Hive, set hive.metastore.warehouse.dir to a JuiceFS directory.

Spark

Spark supports various modes like Standalone / YARN / Kubernetes / Thrift Server.

JuiceFS cache group needs stable client IP to function properly, pay special attention and check if executor has a stable IP when using different running mode.

When the executor process is fixed (like in Thrift Server mode), or in a host with a fixed IP (like Spark on YARN or Standalone), using Distributed Cache is recommended. And juicefs.discover-nodes-url should be set accordingly.

If executor process is ephemeral, and host IP is constantly changing (like Spark on Kubernetes), using Dedicated Cache Cluster is recommended. Remember to set juicefs.discover-nodes-url to all.

  • Spark shell

    scala> sc.textFile("jfs://{VOL_NAME}/path-to-input").count

HBase

HBase mainly store two types of data: WAL file and HFile. Data writes is first submitted to WAL, and then enters RegionServer memstore using hflush, so when RegionServer crashed, data can be recovered from WAL. And after RegionServer data is flushed into HFile on HDFS, WAL file will be deleted so it doesn't eat up storage space.

Due to hflush implementation differences, when juicefs.hflush is set to sync, JuiceFS cannot perform quite as good as HDFS, thus it's recommended to use HDFS for WAL files, and use JuiceFS for the final HFile.

<property>
<name>hbase.rootdir</name>
<value>jfs://{VOL_NAME}/hbase</value>
</property>
<property>
<name>hbase.wal.dir</name>
<value>hdfs://{NAME_SPACE}/hbase-wal</value>
</property>

When using the Streaming File Sink connector, to ensure data consistency, you need to use RollingPolicy for checkpointing.

For Hadoop before 2.7, HDFS truncate is not available, the alternative is OnCheckpointRollingPolicy, this policy creates new files on every checkpoint, which can potentially result in large amount of small files.

For Hadoop 2.7 and after, use DefaultRollingPolicy to allow rotating files based on file size, mtime, and free space.

JuiceFS implements truncate, thus DefaultRollingPolicy is supported.

You can integrate Flink with different file systems via plugin, put juicefs-hadoop.jar inside the lib directory.

Flume

JuiceFS integrates with Flume via HDFS sink. Due to hflush implementation differences (see hflush implementation), pay special attention to hdfs.batchSize as this is the setting that controls hflush message batch size.

To ensure data integrity, set juicefs.hflush to sync, and increase hdfs.batchSize to 4MB, which is the default block size that JuiceFS uploads to object storage.

Also, if compression is enabled (hdfs.fileType set to CompressedStream), in the event of object storage service failure, files are prone to corruption due to missing blocks. In this case, files previously committed using hflush might still face possible data loss. If this is a concern for you, disable compression by setting hdfs.fileType to DataStream, and run compression in later ETL jobs.

Sqoop

When running data imports into Hive using Sqoop, Sqoop will firstly import data into target-dir, which is then later imported into Hive table using hive load. So when using Sqoop, change target-dir accordingly.

  • 1.4.6

    When using this specific version, change the fs parameter to specify JuiceFS as the default file system, after that, you need also to copy the mapreduce.tar.gz from HDFS to JuiceFS, into the same corresponding directory, default path for this file is /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz.

    sqoop import \
    -fs jfs://{VOL_NAME}/ \
    --target-dir jfs://{VOL_NAME}/path-to-dir
  • 1.4.7

    sqoop import \
    --target-dir jfs://{VOL_NAME}/path-to-dir