Skip to main content

Configuration

Hadoop SDK is one of many ways to use JuiceFS, thus most configuration items have the same meaning as JuiceFS Client, you can learn more at Command Reference.

Core configurations

ItemDefault ValueDescription
fs.jfs.implcom.juicefs.JuiceFileSystemSpecify the implementation of the storage type jfs://.
fs.AbstractFileSystem.jfs.implcom.juicefs.JuiceFSSpecify the implementation of AbstractFileSystem for MapReduce.
juicefs.tokenCredential for JuiceFS volume, checkout from the setting page of JuiceFS web console.
juicefs.bucketOptionally provide the name or endpoint of the bucket, to overwrite the configured value in JuiceFS web console.
juicefs.accesskeyAccess Key for object store (omit if client node can access object storage without credentials).
juicefs.secretkeySecret Key for object store (omit if client node can access object storage without credentials).
juicefs.console-urlJuiceFS Web Console address, only needed in on-premise environments.

Data replication configurations

Read Data replication to learn more.

ItemDefault ValueDescription
juicefs.bucket2Optionally provide the name or endpoint for the secondary bucket for Data replication, to overwrite the configured value in JuiceFS web console.
juicefs.accesskey2Access Key for replicate object store (omit if client node can access object storage without credentials).
juicefs.secretkey2Secret Key for replicate object store (omit if client node can access object storage without credentials).

Cache configurations

Read Cache to learn more.

ItemDefault ValueDescription
juicefs.cache-dirmemoryLocal cache directory, default to process memory, can specify multiple directories separate by :, or use wildcards *. When using local directories, you should create them in advance and give 0777 permission so components could share cache data. This option is the same meaning as --cache-dir.
juicefs.cache-size100Cache capacity in MiB. Default size is small because Hadoop SDK uses memory as default cache location. This option is the same meaning as --cache-size.
juicefs.cache-replica1Number of nodes that a Block can be scheduled on. Hadoop applications support data locality scheduling by checking data blocks' BlockLocation attribute, so setting a higher replica will allow blocks to be put on more nodes, hence increasing compute task concurrency. Block size is controlled by juicefs.block.size configuration.
juicefs.cache-groupCache group name for distributed cache. Nodes within the same group share cache, disabled by default. Recommended for applications like Spark where perfect data locality isn't available.
juicefs.no-sharingfalseWhen inside a cache group, only fetch cache data from others, but never share its own cache. Use this option on ephemeral mount points (like Kubernetes Pod).
juicefs.cache-full-blocktrueCache full sized data block, default to true. Disable this when you need to frequently access a same set of small files, or when disk throughput is smaller th an object storage throughput. This option is the opposite meaning as --cache-partial-only.
juicefs.memory-size300Maximum memory for read write buffer in MiB, same meaning as --buffer-size.
juicefs.auto-create-cache-dirtrueWhether to create cache directories automatically. When set to false, non-existent cache directories will be ignored, effectively disabling cache.
juicefs.free-space0.2Minimum free space ratio. When free space is under this ratio, it will clear the cache to free disk space, default to 20%. This option is the same meaning as --free-space-ratio.
juicefs.metacachetrueEnable metadata cache.
juicefs.discover-nodes-urlSpecify the node discovery API, the node list will be refreshed every 10 minutes. Node list is also used as a whitelist for the cache group, only nodes in this list can join the cache group. Use this method to prevent clients outside the computing cluster from joining the cache group, hindering the distributed cache group performance (read cache group troubleshooting for more).

  • All nodes: all, this mode disables auto discovery, hence data locality scheduling isn't available because there's no way to generate BlockLocation
  • YARN: yarn
  • Spark Standalone: http://spark-master:web-ui-port/json/
  • Spark ThriftServer: http://thrift-server:4040/api/v1/applications/
  • Presto: http://coordinator:discovery-uri-port/v1/service/presto/
  • File system: jfs://{VOLUME}/etc/nodes, you need to create this file manually, and write the hostname of the node into this file line by line
For Kerberos clusters, only "All nodes" and "File system" configurations are supported.
juicefs.hflush-delay0Delay hflush (in ms) operations so that data writes is consolidated, this results in fewer object storage PUT requests while increasing overall throughput. Typically used to increase HBase WAL.
juicefs.write-group-cachefalseBuild distributed cache for newly written blocks. Same meaning as --fill-group-cache.
juicefs.entry-cache0.0File entry cache timeout in seconds.
juicefs.dir-entry-cache0.0Directory entry cache timeout in seconds.
juicefs.attr-cache0.0File attribute cache timeout in seconds.
juicefs.block.sizedfs.blocksize or 128MBLogical block size for Hadoop SDK, controls task data sizes for applications like Spark.
juicefs.cache-group-size4 * juicefs.block.sizeJuiceFS Client performs readahead and prefetch, so for files smaller than this size, client will try to schedule all its data blocks into a single node, to maximize cache utilization.

Object storage configurations

ItemDefault ValueDescription
juicefs.bucketSpecify the bucket name of object store.
juicefs.prefetch1Prefetch N blocks in parallel, same as --prefetch
juicefs.max-uploads50Maximum number of concurrency for uploading object
juicefs.upload-limit0Speed limit for uploading object by a single process, units byte/s
juicefs.max-downloads50Maximum number of concurrency for downloading object
juicefs.download-limit0Speed limit for downloading object by a single process, units byte/s
juicefs.get-timeout5The max number of seconds to download an object
juicefs.put-timeout60The max number of seconds to upload an object
juicefs.max-readahead0Maximum memory size in MiB for readahead (read relevant sections in cache to learn about readahead), default to 0, which means the actual max readahead is 20% of juicefs.memory-size. Set this value to a lower int (like 1) to reduce read amplification.
juicefs.externalfalseUsing external domain to access object store

Security configurations

ItemDefault ValueDescription
juicefs.server-principalAfter enabling Kerberos, you need to specify the principal of the JuiceFS metadata service, refer to "Using Kerberos"

Other configurations

ItemDefault ValueDescription
juicefs.access-logThe filepath for file system access log (e.g /tmp/juicefs.access.log), read and write permission is required for all Hadoop components that uses JuiceFS. Log file will rotate at 300MiB, and retain the last 7 files.
juicefs.debugfalseEnable DEBUG level log.
juicefs.superuserhdfsSpecify the superuser name, to tell JuiceFS Hadoop SDK which user is superuser.
juicefs.supergrouphdfsSpecify the supergroup name, all users within this group is considered superuser.
juicefs.rsaPrivKeyPathThe file path of RSA Private Key for data encryption.
juicefs.rsaPassphraseThe passphrase of RSA Key for data encryption.
juicefs.file.checksumfalseEnable checksum for copying data via Hadoop DistCp
juicefs.groupingSpecify the location of the group file to configure user groups and user mapping information, e.g. jfs://myjfs/etc/group. The file format is: <groupname>:<username1>,<username2>

Configure multiple JuiceFS file systems

When using multiple JuiceFS volumes, all of above items can be specified for a single filesystem, the file system name VOL_NAME should to be placed in the middle of the configuration item, such as:

core-site.xml
<property>
<name>juicefs.{VOL_NAME}.debug</name>
<value>false</value>
</property>