JuiceFS CSI Workflow: K8s Pod Creation with PVs

2024-09-30
Arthur

This article is translated from the blog post published on ArthurChiao's Blog.

JuiceFS is a distributed file system built on top of object storage systems (like S3, Ceph, and OSS). Simply put:

  • Object storage: Can only be accessed via key-value pairs.
  • File system: Provides a familiar directory structure that allows operations like ls, cat, find, and truncate.

This article provides a high-level overview of what happens in the JuiceFS CSI solution when creating a pod with a Persistent Volume (PV) and how the pod reads and writes to the PV. It aims to facilitate a quick understanding of the K8s CSI mechanism and the basic working principles of JuiceFS.

JuiceFS CSI workflow
JuiceFS CSI workflow

Background knowledge

K8s CSI

The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COS) like Kubernetes (K8s).

Kubernetes CSI Developer Documentation

CSI is a storage mechanism supported by K8s that offers excellent extensibility. Any storage solution can integrate with K8s by implementing certain interfaces according to the specification.

Generally, a storage solution needs to deploy a service known as a "CSI plugin" on each node. The kubelet calls this plugin during the creation of pods with PVs. Note that:

  • K8s' CNI plugins are executable files placed in /opt/cni/bin/, which the kubelet directly runs when creating pod networks.
  • K8s' CSI plugins are services (which might be better understood as agents), and the kubelet communicates with them via gRPC during PV initialization.

FUSE

Filesystem in Userspace (FUSE) is a user-space file system that makes it easy for users to develop their own file systems. Here’s a basic overview using lxcfs (which is unrelated to JuiceFS but is also a FUSE file system) to illustrate FUSE's working principle:

lxcfs/fuse workflow: how a read operation is handled
lxcfs/fuse workflow: how a read operation is handled

JuiceFS is based on FUSE to implement a user-space filesystem. Here’s a brief overview from community documentation: Traditionally, implementing a FUSE file system requires using Linux libfuse. It provides two types of APIs:

  • High-level APIs: Based on file names and paths. The internal libfuse simulates a virtual file system (VFS) tree and exposes path-based APIs. This is suitable for systems where metadata is provided via path-based APIs, like HDFS or S3. If the metadata is based on an inode directory tree, the conversion from inodes to path and back to inodes can impact performance.
  • Low-level APIs: Based on inodes. The kernel's VFS interacts with the FUSE library using low-level APIs.

JuiceFS organizes its metadata based on inodes, so it uses the low-level API, relying on go-fuse instead of libfuse, which is simpler and performs better.

JuiceFS modes of operation

JuiceFS has several deployment modes:

Process mount mode: The JuiceFS client runs in the CSI Node plugin container, mounting all required JuiceFS PVs within this container in process mode. CSI mode, which can be further divided into two types:

  • Mount Pod mode: The CSI plugin dynamically creates a caretaker pod for each PV used by local pods on each node. This Mount Pod is per PV rather than per business pod. This means if multiple business pods use the same PV, only one Mount Pod will be created, as shown in the figure below.
Workflow for creating a business pod in Mount Pod mode
Workflow for creating a business pod in Mount Pod mode
  • The Mount Pod contains the JuiceFS client, handling all read and write operations related to JuiceFS for the business pod. For clarity, this article will refer to the Mount Pod as a dynamic client pod or client pod.
  • This is the default working mode of JuiceFS CSI.
  • The Mount Pod needs to have privilege permissions for FUSE.
  • If the client pod restarts, it may temporarily prevent the business pod from reading and writing until the client pod is back up.

CSI sidecar mode: A sidecar container is created for each business pod using JuiceFS PV.

  • The sidecar is per-pod level.
  • The sidecar is not created by the JuiceFS plugin. The CSI Controller registers a webhook that listens for container changes, automatically injecting a sidecar into the pod YAML when the pod is created. This is similar to how Istio injects the Envoy container.
  • The sidecar restarts require the business pod to be rebuilt for recovery. It also requires privilege permissions for FUSE. This poses risks since each sidecar can see all devices on the node, so it's not recommended.

Next, let’s look at what happens when creating a business pod that requires a mounted PV in K8s and the actions of K8s and JuiceFS components.

What happens when creating a pod using a PV

Workflow when a business pod is created (JuiceFS Mount Pod mode):

Workflow for creating a business pod
Workflow for creating a business pod

Step 1: The kubelet starts and listens for changes in pod resources across the cluster.

As the K8s agent on each node, the kubelet listens for changes to pod resources in the K8s cluster after startup. Specifically, it receives events for pod creation, updates, and deletions from the kube-apiserver.

Step 2: The kubelet receives a business pod creation event and begins creating the pod.

Upon receiving a pod creation event, the kubelet first checks if the pod is under its management (whether the nodeName in spec matches this node). If yes, it starts creating the pod.

1. Create a business pod

kubelet.INFO contains detailed information:

10:05:57.410  Receiving a new pod "pod1(<pod1-id>)"
10:05:57.411  SyncLoop (ADD, "api"): "pod1(<pod1-id>)"
10:05:57.411  Needs to allocate 2 "nvidia.com/gpu" for pod "<pod1-id>" container "container1"
10:05:57.411  Needs to allocate 1 "our-corp.com/ip" for pod "<pod1-id>" container "container1"
10:05:57.413  Cgroup has some missing paths: [/sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id>]
10:05:57.413  Cgroup has some missing paths: [/sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id>]
10:05:57.413  Cgroup has some missing paths: [/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id>]
10:05:57.415  Using factory "raw" for container "/kubepods/burstable/pod<pod1-id>"
10:05:57.415  Added container: "/kubepods/burstable/pod<pod1-id>" (aliases: [], namespace: "")
10:05:57.419  Waiting for volumes to attach and mount for pod "pod1(<pod1-id>)"

10:05:57.432  SyncLoop (RECONCILE, "api"): "pod1(<pod1-id>)"

10:05:57.471  Added volume "meminfo" (volSpec="meminfo") for pod "<pod1-id>" to desired state.
10:05:57.471  Added volume "cpuinfo" (volSpec="cpuinfo") for pod "<pod1-id>" to desired state.
10:05:57.471  Added volume "stat" (volSpec="stat") for pod "<pod1-id>" to desired state.
10:05:57.480  Added volume "share-dir" (volSpec="pvc-6ee43741-29b1-4aa0-98d3-5413764d36b1") for pod "<pod1-id>" to desired state.
10:05:57.484  Added volume "data-dir" (volSpec="juicefs-volume1-pv") for pod "<pod1-id>" to desired state.
...

You can see that the required resources for the pod are processed sequentially:

  • Devices, such as GPUs
  • IP addresses
  • cgroup resource isolation configurations
  • Volumes

This article focuses on volume resources.

2. Handle the volumes required by the pod.

From the logs, you can see that the business pod declares several volumes, including:

  • HostPath type: Directly mounts a node path into the container.
  • LXCFS type: For resolving resource view issues.
  • Dynamic/static PV types: The JuiceFS volume falls into this category.

The kubelet logs:

# kubelet.INFO
10:05:57.509  operationExecutor.VerifyControllerAttachedVolume started for volume "xxx"
10:05:57.611  Starting operationExecutor.MountVolume for volume "xxx" (UniqueName: "kubernetes.io/host-path/<pod1-id>-xxx") pod "pod1" (UID: "<pod1-id>") 
10:05:57.611  operationExecutor.MountVolume started for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") 
10:05:57.611  kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]
10:05:57.611  kubernetes.io/csi: created path successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv]
10:05:57.611  kubernetes.io/csi: saving volume data file [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/vol_data.json]
10:05:57.611  kubernetes.io/csi: volume data file saved successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/vol_data.json]
10:05:57.613  MountVolume.MountDevice succeeded for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") device mount path "/var/lib/k8s/kubelet/plugins/kubernetes.io/csi/pv/juicefs-volume1-pv/globalmount"
10:05:57.616  kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]
10:05:57.616  kubernetes.io/csi: Mounter.SetUpAt(/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount)
10:05:57.616  kubernetes.io/csi: created target path successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]
10:05:57.618  kubernetes.io/csi: calling NodePublishVolume rpc [volid=juicefs-volume1-pv,target_path=/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]
10:05:57.713  Starting operationExecutor.MountVolume for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") 
...
10:05:59.506  kubernetes.io/csi: mounter.SetUp successfully requested NodePublish [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]
10:05:59.506  MountVolume.SetUp succeeded for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") 
10:05:59.506  kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]

For each volume, it executes these operations:

  1. The operationExecutor.VerifyControllerAttachedVolume() method is called to perform some checks.
  2. The operationExecutor.MountVolume() method mounts the specified volume to the container's directory.
  3. For CSI storage, the NodePublishVolume() method of the CSI plugin is also called to initialize the corresponding PV, which is the case for JuiceFS.

The kubelet will continuously check whether all volumes are properly mounted; if they are not, it will not proceed to the next step (creating the sandbox container).

Step 3: kubelet --> CSI plugin (JuiceFS): Setup PV

Let’s take a closer look at the logic for initializing PV mounts in the node CSI plugin. The call stack is as follows:

         gRPC NodePublishVolume()
kubelet ---------------------------> juicefs node plugin (also called "driver", etc)

Step 4: What the JuiceFS CSI plugin does

Let’s check the logs of the JuiceFS CSI node plugin directly on the machine:

(node) $ docker logs --timestamps k8s_juicefs-plugin_juicefs-csi-node-xxx | grep juicefs-volume1
10:05:57.619 NodePublishVolume: volume_id is juicefs-volume1-pv

10:05:57.619 NodePublishVolume: creating dir /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount

10:05:57.620 ceFormat cmd: [/usr/local/bin/juicefs format --storage=OSS --bucket=xx --access-key=xx --secret-key=${secretkey} --token=${token} ${metaurl} juicefs-volume1]
10:05:57.874 Format output is juicefs <INFO>: Meta address: tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1
10:05:57.874 cefs[1983] <INFO>: Data use oss://<bucket>/juicefs-volume1/

10:05:57.875 Mount: mounting "tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1" at "/jfs/juicefs-volume1-pv" with options [token=xx]

10:05:57.884 createOrAddRef: Need to create pod juicefs-node1-juicefs-volume1-pv.
10:05:57.891 createOrAddRed: GetMountPodPVC juicefs-volume1-pv, err: %!s(<nil>)
10:05:57.891 ceMount: mount tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 at /jfs/juicefs-volume1-pv
10:05:57.978 createOrUpdateSecret: juicefs-node1-juicefs-volume1-pv-secret, juicefs-system
10:05:59.500 waitUtilPodReady: Pod juicefs-node1-juicefs-volume1-pv is successful

10:05:59.500 NodePublishVolume: binding /jfs/juicefs-volume1-pv at /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount with options []
10:05:59.505 NodePublishVolume: mounted juicefs-volume1-pv at /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount with options []

It’s clear that the NodePublishVolume() method is executed. This method is uniquely implemented by each CSI plugin, so its functionality is closely tied to the storage solution. Let’s examine what the JuiceFS plugin does in detail.

1. Create mount path for pod PV and initialize the volume

By default, each pod corresponds to a storage path on the node:

(node) $ ll /var/lib/k8s/kubelet/pods/<pod-id>
containers/
etc-hosts
plugins/
volumes/

The JuiceFS plugin creates a corresponding subdirectory and mount point within the volumes/ directory:

/var/lib/k8s/kubelet/pods/{pod1-id}/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount

It then formats the volume using the JuiceFS command-line tool:

$ /usr/local/bin/juicefs format --storage=OSS --bucket=xx --access-key=xx --secret-key=${secretkey} --token=${token} ${metaurl} juicefs-volume1

2. Write volume mount information to Meta Server

The mount information is synchronized with JuiceFS' Meta Server, which is TiKV in this example:

Writing volume mount information to Meta Server
Writing volume mount information to Meta Server

3. JuiceFS plugin: Create a client pod if it doesn’t exist

The JuiceFS CSI plugin checks whether a client pod exists on the node for the PV. If it doesn't, the plugin creates one; if it does, no further action is required.

When the last business pod using a PV is destroyed, the corresponding client pod is automatically removed by the JuiceFS CSI plugin.

In our environment, we use a dynamic client pod approach, so we see the following logs:

(node) $ docker logs --timestamps <csi plugin container> | grep 
...
10:05:57.884 createOrAddRef: Need to create pod juicefs-node1-juicefs-volume1-pv.
10:05:57.891 createOrAddRed: GetMountPodPVC juicefs-volume1-pv, err: %!s(<nil>)
10:05:57.891 ceMount: mount tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 at /jfs/juicefs-volume1-pv
10:05:57.978 createOrUpdateSecret: juicefs-node1-juicefs-volume1-pv-secret, juicefs-system
10:05:59.500 waitUtilPodReady:

The JuiceFS node plugin creates a dynamic client pod named juicefs-{node}-{volume}-pv in K8s.

The JuiceFS node plugin creates a client pod
The JuiceFS node plugin creates a client pod

Step 5: The kubelet listens for client pod creation events

At this point, the business pod hasn’t been created yet, but its corresponding JuiceFS client pod is requesting creation:

(node) $ grep juicefs-<node>-<volume>-pv /var/log/kubernetes/kubelet.INFO | grep "received "
10:05:58.288 SyncPod received new pod "juicefs-node1-volume1-pv_juicefs-system", will create a sandbox for it

The process for creating the JuiceFS dynamic client pod follows next. Without the client pod, the business pod cannot read or write to the JuiceFS volume even if it’s running.

Step 6: The kubelet creates a client pod

The process for creating the client pod is similar to that of the business pod, but it’s easier. We’ll skip the details, assuming it starts up directly.

To check the processes running in this client pod:

(node) $ dk top k8s_jfs-mount_juicefs-node1-juicefs-volume1-pv-xx
/bin/mount.juicefs ${metaurl} /jfs/juicefs-volume1-pv -o enable-xattr,no-bgjob,allow_other,token=xxx,metrics=0.0.0.0:9567

/bin/mount.juicefs is merely an alias pointing to the juicefs executable file:

(pod) $ ls -ahl /bin/mount.juicefs
/bin/mount.juicefs -> /usr/local/bin/juicefs

Step 7: Client pod initialization and FUSE mount

Let’s see what this client pod does:

root@node:~  # dk top k8s_jfs-mount_juicefs-node1-juicefs-volume1-pv-xx
<INFO>: Meta address: tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1
<INFO>: Data use oss://<oss-bucket>/juicefs-volume1/
<INFO>: Disk cache (/var/jfsCache/<id>/): capacity (10240 MB), free ratio (10%), max pending pages (15)
<INFO>: Create session 667 OK with version: admin-1.2.1+2022-12-22.34c7e973
<INFO>: listen on 0.0.0.0:9567
<INFO>: Mounting volume juicefs-volume1 at /jfs/juicefs-volume1-pv ...
<INFO>: OK, juicefs-volume1 is ready at /jfs/juicefs-volume1-pv

This initializes the local volume configuration, interacts with the Meta Server, and exposes Prometheus metrics. Using JuiceFS' own mount implementation (/bin/mount.juicefs mentioned earlier), it mounts the volume at /jfs/juicefs-volume1-pv, typically corresponding to /var/lib/juicefs/volume/juicefs-volume1-pv.

At this point, you can observe the following mount information on the node:

(node) $ cat /proc/mounts | grep JuiceFS:juicefs-volume1
JuiceFS:juicefs-volume1 /var/lib/juicefs/volume/juicefs-volume1-pv fuse.juicefs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
JuiceFS:juicefs-volume1 /var/lib/k8s/kubelet/pods/<pod-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount fuse.juicefs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0

This shows that it’s mounted using the fuse.juicefs method.

After this dynamic client pod is created, any read or write operations by the business pod (which still does not exist) will go through the FUSE module and be forwarded to the user-space JuiceFS client for processing. The JuiceFS client implements corresponding read and write methods for different object stores.

Step 8: The kubelet creates a business pod

Now that all volumes required by the pod have been processed. The kubelet will log:

# kubelet.INFO
10:06:06.119  All volumes are attached and mounted for pod "pod1(<pod1-id>)"

It can then proceed to create the business pod:

# kubelet.INFO
10:06:06.119  No sandbox for pod "pod1(<pod1-id>)" can be found. Need to start a new one
10:06:06.119  Creating PodSandbox for pod "pod1(<pod1-id>)"
10:06:06.849  Created PodSandbox "885c3a" for pod "pod1(<pod1-id>)"
...

How the business pod reads and writes to JuiceFS volume

The JuiceFS dynamic client pod is created before the business pod, allowing the business pod to read and write to the JuiceFS PV (volume) immediately upon creation.

Workflow of a business pod reading/writing
Workflow of a business pod reading/writing

This process can be divided into four steps.

Step 1: The pod reads/writes to files

For example, within the pod, navigate to the volume path (for example, cd /data/juicefs-pv-dir/) and perform operations like ls and find.

Step 2: R/W requests hooked by FUSE module and forwarded to the JuiceFS client

Here are two diagrams for illustration: Read operations:

JuiceFS read operations
JuiceFS read operations

Write operations:

JuiceFS write operations
JuiceFS write operations

Step 3: The JuiceFS client pod reads metadata from Meta Server

The figure above has revealed some of JuiceFS' metadata design, such as chunk, slice, and block. During read and write operations, the client will interact with meta information related to Meta Server.

Step 4: The JuiceFS client pod reads/writes to files from object store

In this step, files are read from and written to an object store like S3.

Summary

This article outlined the process of creating a pod with a PV using JuiceFS as a K8s CSI plugin and how the pod interacts with the PV. Many details have been omitted due to space constraints. If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.

Author

Arthur

Related Posts

Training LLMs: Best Practices for Storing Thousands of Nodes in K8s

2024-10-09
Learn the best practices for training large language models in Kubernetes clusters with thousands o…

From Object Storage to K8s+JuiceFS: 85% Storage Cost Cut, HDFS-Level Performance

2024-02-07
Explore the evolution of a financial technology company's big data platform on the cloud, transitio…

JuiceFS CSI Dashboard Simplifies Troubleshooting in Cloud

2024-01-31
Explore JuiceFS CSI Dashboard designed to streamline troubleshooting in complex K8s environments.

NAVER, Korea's No.1 Search Engine, Chose JuiceFS over Alluxio for AI Storage

2024-01-17
Discover how NAVER, South Korea's top search engine, overcame AI storage challenges by choosing Jui…