Skip to main content

Installation

Prerequisites

  1. Network condition requirements:
    • For JuiceFS Cloud Service, Hadoop SDK needs to access metadata service and console through public internet, you'll probably need to deploy a NAT to provide public internet access to the Hadoop cluster.
    • For JuiceFS on-premise deployment, metadata service and console are already deployed within the VPC, so Hadoop nodes need to have network access to metadata service and console.
  2. JuiceFS Hadoop Java SDK does not create object storage bucket automatically, you should create them in advance.
  3. Download the latest juicefs-hadoop.jar, this package will be used in later stages.

Hadoop distribution

Cloudera Distribution Hadoop (CDH)

Install Using Parcel

  1. On Cloudera Manager node, download Parcel and CSD, put CSD into /opt/cloudera/csd, and extract Parcel into /opt/cloudera/parcel-repo.

  2. Restart Cloudera Manager

    service cloudera-scm-server restart
  3. Activate Parcel

    Open Cloudera Manager Admin Console → Hosts → Check for New Parcels → JuiceFS → Distribute → Active

  4. Add Service

    Open Cloudera Manager Admin Console → Cluster → Add Service → JuiceFS → Choose node

  5. Deploy JAR File

    deploy-JAR

  6. Upgrade

    To upgrade, download the latest version of the Parcel file using above steps, and then activate using step 3.

Configure in Cloudera Manager

  • Hadoop

    • CDH 5.x

      Modify core-site.xml in the Cloudera Manager:

      core-site

      Example:

      fs.jfs.impl=com.juicefs.JuiceFileSystem
      fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS
      juicefs.cache-size=10240
      juicefs.cache-dir=xxxxxx
      juicefs.cache-group=yarn
      juicefs.discover-nodes-url=yarn
      juicefs.accesskey=xxxxxx
      juicefs.secretkey=xxxxxx
      juicefs.token=xxxxxx
      juicefs.access-log=/tmp/juicefs.access.log
    • CDH 6.x

      Follow above steps for 5.x version, in addition you also need to add the following path to mapreduce.application.classpath at YARN service interface:

      $HADOOP_COMMON_HOME/lib/juicefs-hadoop.jar
  • HBase

    Modify hbase-site.xml at HBase service interface:

    HBase-site

    Add the following lines:

    <property>
    <name>hbase.rootdir</name>
    <value>jfs://{VOL_NAME}/hbase</value>
    </property>
    <property>
    <name>hbase.wal.dir</name>
    <value>hdfs://your-hdfs-uri/hbase-wal</value>
    </property>

    Delete znode (default /hbase) zookeeper.znode.parent, this operation will delete all data in the original HBase cluster.

  • Hive

    Modify hive.metastore.warehouse.dir at Hive service interface, change the default location for table creation in Hive(optional):

    jfs://myjfs/your-warehouse-dir
  • Impala

    Modify Impala Command Line Advanced Configuration at Impala service web UI.

    To increase the number of I/O threads, you could change the following parameter as 20 / Number-of-Mounted-Disks.

    -num_io_threads_per_rotational_disk=4
  • Solr

    Modify Solr Advanced Configuration Code Snippets at Solr service web UI:

    hdfs_data_dir=jfs://myjfs/solr

Finally, restart the cluster to take effect.

Hortonworks Data Platform (HDP)

Integrate JuiceFS to Ambari

  1. Download HDP, unzip to /var/lib/ambari-server/resources/stacks/HDP/{YOUR-HDP-VERSION}/services

  2. Restart Ambari

    systemctl restart ambari-server
  3. Add JuiceFS Service

    Ambari Management Console → Services → Add Service → JuiceFS → Choose node → Configure → Deploy

    In the Configure step, pay attention to cache_dirs and download_url and change according to your environment.

    If Ambari has no access to internet, put the downloaded JAR file into share_download_dir, which defaults to the HDFS /tmp directory.

  4. Upgrade JuiceFS

    Change the version string in download_url, save and refresh configuration.

    jfs-HDP

Configure in Ambari

  • Hadoop

    Modify core-site.xml at HDFS interface, see Configuration.

  • MapReduce2

    At MapReduce2 interface, add the literal string :/usr/hdp/${hdp.version}/hadoop/lib/juicefs-hadoop.jar to mapreduce.application.classpath.

  • Hive

    Optionally modify hive.metastore.warehouse.dir for default location of table creation at Hive interface:

    jfs://myjfs/your-warehouse-dir

    If Ranger is available, append jfs at the last of ranger.plugin.hive.urlauth.filesystem.schemes:

    ranger.plugin.hive.urlauth.filesystem.schemes=hdfs:,file:,wasb:,adl:,jfs:
  • Druid

    Change the directory address at Druid interface (May need to manually create due to permission):

    "druid.storage.storageDirectory": "jfs://myjfs/apps/druid/warehouse"
    "druid.indexer.logs.directory": "jfs://myjfs/user/druid/logs"
  • HBase

    Modify these parameters at HBase

    hbase.rootdir=jfs://myjfs/hbase
    hbase.wal.dir=hdfs://your-hdfs-uri/hbase-wal

    Delete znode (default /hbase) zookeeper.znode.parent, this operation will delete all data in the original HBase cluster.

  • Sqoop

    When import data into Hive by Sqoop, Sqoop should import data into target-dir first, and then load into the Hive table with hive load, so you need to modify target-dir when using Sqoop.

    • 1.4.6

      For version 1.4.6, You need modify fs to update the default filesystem. Please copy mapreduce.tar.gz to the same path from HDFS to JuiceFS, default path is /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz.

      sqoop import \
      -fs jfs://myjfs/ \
      --target-dir jfs://myjfs/tmp/your-dir
    • 1.4.7

      sqoop import \
      --target-dir jfs://myjfs/tmp/your-dir

Finally, restart the service to take effect.

EMR platform

Integrating JuiceFS with EMR platform is relatively easy, just these two steps:

  • Add a bootstrap step to run the installation script, i.e. emr-boot.sh --jar s3://xxx/juicefs-hadoop.jar;
  • Modify core-site.xml and add JuiceFS related configs.

Existing EMR cluster

Common EMR platforms does not allow changing bootstrap configurations post-creation, which means you can't conveniently add bootstrap steps in their web console.

Thus, if you are dealing with existing EMR cluster, you should use other methods to run the installation script whenever a new node joins the cluster.

Amazon EMR

  1. Fill in the configuration in Software and Steps UI

    AWS-EMR-conf

    Fill in JuiceFS related configurations, the format is as follows:

    [
    {
    "classification": "core-site",
    "properties": {
    "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
    "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
    "juicefs.token": "",
    ...
    }
    }
    ]

    HBase on JuiceFS configuration:

    [
    {
    "classification": "core-site",
    "properties": {
    "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
    "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
    ...
    }
    },
    {
    "classification": "hbase-site",
    "properties": {
    "hbase.rootdir": "jfs: //{name}/hbase"
    }
    }
    ]
  2. Create bootstrap actions in General Cluster Settings

    Download emr-boot.sh and juicefs-hadoop.jar, upload them to S3.

    AWS-EMR-boot

    Fill in the script location for emr-boot.sh, and use the S3 address for juicefs-hadoop-{version}.jar for arguments.

    --jar s3://{bucket}/resources/juicefs-hadoop-{version}.jar

Specific Hadoop applications

If you would like to install Hadoop SDK for specific applications, you'll need to put the juicefs-hadoop.jar (downloaded in previous steps) and $JAVA_HOME/lib/tools.jar files inside the application installation directory (listed down below).

NameInstallation directory
Hadoop${HADOOP_HOME}/share/hadoop/common/lib/
Hive${HIVE_HOME}/auxlib
Spark${SPARK_HOME}/jars
Presto${PRESTO_HOME}/plugin/hive-hadoop2
Flink${FLINK_HOME}/lib
DataX${DATAX_HOME}/plugin/writer/hdfswriter/libs
${DATAX_HOME}/plugin/reader/hdfsreader/libs

Apache Spark

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, can be done via:

    • Modify core-site.xml

    • Passing below arguments in command line:

      spark-shell --master local[*] \
      --conf spark.hadoop.fs.jfs.impl=com.juicefs.JuiceFileSystem \
      --conf spark.hadoop.fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS \
      --conf spark.hadoop.juicefs.token=xxx \
      --conf spark.hadoop.juicefs.accesskey=xxx \
      --conf spark.hadoop.juicefs.secretkey=xxx \
      ...
    • Modifying $SPARK_HOME/conf/spark-defaults.conf

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, can be done via:

    • Modifying core-site.xml
    • Modifying flink-conf.yaml

Presto

  1. Install juicefs-hadoop.jar
  2. Add JuiceFS config via modifying core-site.xml

DataX

  1. Install juicefs-hadoop.jar

  2. Add JuiceFS config, modify DataX configuration file:

    "defaultFS": "jfs://myjfs",
    "hadoopConfig": {
    "fs.jfs.impl": "com.juicefs.JuiceFileSystem",
    "fs.AbstractFileSystem.jfs.impl": "com.juicefs.JuiceFS",
    "juicefs.access-log": "/tmp/juicefs.access.log",
    "juicefs.token": "xxx",
    "juicefs.accesskey": "xxx",
    "juicefs.secretkey": "xxx"
    }

HBase

Modify hbase-site:

"hbase.rootdir": "jfs://myjfs/hbase"

Modify hbase:

"hbase.emr.storageMode": "jfs"

Common configurations

After installation, you'll need to configure JuiceFS inside core-site.xml, some of the more common settings have been listed here, see their full description at Configuration.

Basic configurations

fs.jfs.impl=com.juicefs.JuiceFileSystem
fs.AbstractFileSystem.jfs.impl=com.juicefs.JuiceFS
juicefs.token=
juicefs.accesskey=
juicefs.secretkey=
# File system access logs, often used for performance troubleshooting, turn off to save disk space
juicefs.access-log=/tmp/juicefs.access.log
# Overwrite Web Console address in on-premise environments
juicefs.console-url=http://host:port

Cache configurations

To allow all Hadoop applications to access cache data, we recommend you to manually create the cache directory in advance, and set permission to 777. Learn more in Cache.

# Local cache directory
juicefs.cache-dir=
# Cache capacity in MiB
juicefs.cache-size=
# Declare cache group name, if distributed cache is enabled
juicefs.cache-group=
juicefs.discover-nodes-url=

For scenarios in which clients are constantly changing, consider using a dedicated cache cluster, build a cache cluster using the following example command:

juicefs mount \
--buffer-size=2000 \
--cache-group=xxx \
--cache-group-size=512 \
--cache-dir=xxx \
--cache-size=204800 \
{$VOL_NAME} /jfs

Common configurations for dedicated cluster:

juicefs.cache-dir=
juicefs.cache-size=
juicefs.cache-group=xxx
juicefs.discover-nodes-url=all
# The computing nodes may change dynamically and do not need to join the cache group, so set juicefs.no-sharing to true.
juicefs.no-sharing=true

Verification

Use below steps to verify JuiceFS is working properly under Hadoop.

Hadoop

hadoop fs -ls jfs://${VOL_NAME}/
hadoop fs -mkdir jfs://${VOL_NAME}/jfs-test
hadoop fs -rm -r jfs://${VOL_NAME}/jfs-test

Hive, SparkSQL, Impala

create table if not exists person(
name string,
age int
)

location 'jfs://${VOL_NAME}/tmp/person';
insert into table person values('tom',25);
insert overwrite table person select name, age from person;
select name, age from person;
drop table person;

Spark Shell

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
val conf = sc.hadoopConfiguration
val p = new Path("jfs://${VOL_NAME}/")
val fs = p.getFileSystem(conf)
fs.listStatus(p)

HBase

create 'test', 'cf'
list 'test'
put 'test', 'row1', 'cf:a', 'value1'
scan 'test'
get 'test', 'row1'
disable 'test'
drop 'test'

Flume

jfs.sources =r1
jfs.sources.r1.type = org.apache.flume.source.StressSource
jfs.sources.r1.size = 10240
jfs.sources.r1.maxTotalEvents=10
jfs.sources.r1.batchSize=10
jfs.sources.r1.channels = c1

jfs.channels = c1
jfs.channels.c1.type = memory
jfs.channels.c1.capacity = 100
jfs.channels.c1.transactionCapacity = 100

jfs.sinks = k1
jfs.sinks.k1.type = hdfs
jfs.sinks.k1.channel = c1
jfs.sinks.k1.hdfs.path =jfs://${VOL_NAME}/tmp/flume
jfs.sinks.k1.hdfs.writeFormat= Text
jfs.sinks.k1.hdfs.fileType= DataStream
echo 'hello world' > /tmp/jfs_test
hadoop fs -put /tmp/jfs_test jfs://${VOL_NAME}/tmp/
rm -f /tmp/jfs_test
./bin/flink run -m yarn-cluster ./examples/batch/WordCount.jar --input jfs://${VOL_NAME}/tmp/jfs_test --output jfs://${VOL_NAME}/tmp/result

Restart relevant services

After Hadoop SDK has been installed or upgraded, relevant services need to be restarted for changes to take effect.

  • Services that doesn't require a restart:

    • HDFS
    • HUE
    • ZooKeeper
  • Restart these services if they are using JuiceFS, if HA has been configured, you can perform a rolling restart to avoid downtime.

    • YARN
    • Hive Metastore Server, HiveServer2
    • Spark ThriftServer
    • Spark Standalone
      • Master
      • Worker
    • Presto
      • Coordinator
      • Worker
    • Impala
      • Catalog Server
      • Daemon
    • HBase
      • Master
      • RegionServer
    • Flume

Best Practices

Hadoop

Hadoop Applications typically integrates with HDFS using the org.apache.hadoop.fs.FileSystem class, JuiceFS brings support to Hadoop through inheriting this class.

After Hadoop SDK has been set up, use hadoop fs to manage JuiceFS:

hadoop fs -ls jfs://{VOL_NAME}/

To drop the jfs:// scheme, set fs.defaultFS to the JuiceFS volume in core-site.xml:

<property>
<name>fs.defaultFS</name>
<value>jfs://{VOL_NAME}</value>
</property>
note

Changing defaultFS essentially makes all path (without scheme specification) to resolve to JuiceFS, this might cause unexpected problems and should be tested thoroughly.

Hive

Use the LOCATION clause to specify data location for Hive database / table.

  • Creating database / table

    CREATE DATABASE ... database_name
    LOCATION 'jfs://{VOL_NAME}/path-to-database';

    CREATE TABLE ... table_name
    LOCATION 'jfs://{VOL_NAME}/path-to-table';

    A Hive table is by default stored under the database location, so if the database is already stored on JuiceFS, all tables within also resides in JuiceFS.

  • Migrate database / table

    ALTER DATABASE database_name
    SET LOCATION 'jfs://{VOL_NAME}/path-to-database';

    ALTER TABLE table_name
    SET LOCATION 'jfs://{VOL_NAME}/path-to-table';

    ALTER TABLE table_name PARTITION(...)
    SET LOCATION 'jfs://{VOL_NAME}/path-to-partition';

    For Hive, data can be stored on multiple file systems. For an unpartitioned table, all data must be located in one file system. For partitioned table, you can configure file systems for each single partition.

    To use JuiceFS as the default storage for Hive, set hive.metastore.warehouse.dir to a JuiceFS directory.

Spark

Spark supports various modes like Standalone / YARN / Kubernetes / Thrift Server.

JuiceFS cache group needs stable client IP to function properly, pay special attention and check if executor has a stable IP when using different running mode.

When the executor process is fixed (like in Thrift Server mode), or in a host with a fixed IP (like Spark on YARN or Standalone), using Distributed Cache is recommended. And juicefs.discover-nodes-url should be set accordingly.

If executor process is ephemeral, and host IP is constantly changing (like Spark on Kubernetes), using Dedicated Cache Cluster is recommended. Remember to set juicefs.discover-nodes-url to all.

  • Spark shell

    scala> sc.textFile("jfs://{VOL_NAME}/path-to-input").count

HBase

HBase mainly store two types of data: WAL file and HFile. Data writes is first submitted to WAL, and then enters RegionServer memstore using hflush, so when RegionServer crashed, data can be recovered from WAL. And after RegionServer data is flushed into HFile on HDFS, WAL file will be deleted so it doesn't eat up storage space.

Due to hflush implementation differences, when juicefs.hflush is set to sync, JuiceFS cannot perform quite as good as HDFS, thus it's recommended to use HDFS for WAL files, and use JuiceFS for the final HFile.

<property>
<name>hbase.rootdir</name>
<value>jfs://{VOL_NAME}/hbase</value>
</property>
<property>
<name>hbase.wal.dir</name>
<value>hdfs://{NAME_SPACE}/hbase-wal</value>
</property>

When using the Streaming File Sink connector, to ensure data consistency, you need to use RollingPolicy for checkpointing.

For Hadoop before 2.7, HDFS truncate is not available, the alternative is OnCheckpointRollingPolicy, this policy creates new files on every checkpoint, which can potentially result in large amount of small files.

For Hadoop 2.7 and after, use DefaultRollingPolicy to allow rotating files based on file size, mtime, and free space.

JuiceFS implements truncate, thus DefaultRollingPolicy is supported.

You can integrate Flink with different file systems via plugin, put juicefs-hadoop.jar inside the lib directory.

Flume

JuiceFS integrates with Flume via HDFS sink. Due to hflush implementation differences (see hflush implementation), pay special attention to hdfs.batchSize as this is the setting that controls hflush message batch size.

To ensure data integrity, set juicefs.hflush to sync, and increase hdfs.batchSize to 4MB, which is the default block size that JuiceFS uploads to object storage.

Also, if compression is enabled (hdfs.fileType set to CompressedStream), in the event of object storage service failure, files are prone to corruption due to missing blocks. In this case, files previously committed using hflush might still face possible data loss. If this is a concern for you, disable compression by setting hdfs.fileType to DataStream, and run compression in later ETL jobs.

Sqoop

When running data imports into Hive using Sqoop, Sqoop will firstly import data into target-dir, which is then later imported into Hive table using hive load. So when using Sqoop, change target-dir accordingly.

  • 1.4.6

    When using this specific version, change the fs parameter to specify JuiceFS as the default file system, after that, you need also to copy the mapreduce.tar.gz from HDFS to JuiceFS, into the same corresponding directory, default path for this file is /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz.

    sqoop import \
    -fs jfs://{VOL_NAME}/ \
    --target-dir jfs://{VOL_NAME}/path-to-dir
  • 1.4.7

    sqoop import \
    --target-dir jfs://{VOL_NAME}/path-to-dir