Author,Dongdong Lv, Architect of Unisound HPC Platform
He is responsible for the architecture design and development of the large-scale distributed machine learning platform, as well as application optimization of deep learning algorithms and acceleration of AI modele training. His research areas include large-scale cluster scheduling, high-performance computing, distributed file storage, distributed caching, etc. He is also a fan of open source community, especially cloud native related.
Unisound is an AI company focused on speech and natural language processing technology. Atlas,Unisound’s internal HPC platform, which supports Unisound with basic computing capabilities such as training acceleration for model iterations in various AI domains (e.g. speech, natural language processing, vision, etc.) The deep learning computing power of the Atlas platform exceeds 57 PFLOPS, which is the core metric for measuring the computing performance of an AI platform.
In this article, we will share the storage construction history of Atlas and the practice of building efficient storage based on JuiceFS.
HPC platform requires not only sufficient computing power support, but also an efficient storage system. Based on the characteristics and types of tasks on Atlas, an efficient storage system should have several characteristics, such as: meeting the needs of multiple types of structured and unstructured data storage, being compatible with POSIX, and having good performance in massive small file scenarios.
In the earliest days of building Atlas, we tried to deploy CephFS. When using the open source version of CephFS, it began to have serious performance problems as the storage scale reached tens of millions of small files. Users would encounter lag when operating files or even the whole storage system would freeze directly under high IO scenarios.
Later, we moved to the more widely used product in HPC field, Lustre, a parallel file system. And we built several Lustre clusters of different sizes as the core storage system of the platform. There are mainly 40G Ethernet and 100G InfiniBand type clusters in the production environment. Lustre supports a series of scenarios such as data processing, model training, source code compilation and debugging, and data archiving in Atlas HPC cluster.
However, due to the performance bottleneck of Lustre under high concurrent requests, it cannot meet the requirements of scenarios with high bandwidth and IOPS requirements. Therefore, we use Alluxio + Fluid for IO acceleration, and the distributed cache gives us a speedup in AI model training and a reduction in total storage system bandwidth.
But the above solution is still not the final one, so we are also exploring a new storage system. The core requirements for this new storage system on our scenario are as follows:
- Easy Operation & Maintenance: it should be simple enough for relevant staff to get started quickly, expand capacity and troubleshooting. Lustre provides a series of IML automated deployment and monitoring tools, which are convenient in terms of operations and maintenance. However, since Lustre's software code runs in the Linux kernel, if there is a problem, it is less intuitive to locate and needs to be located from the kernel message side, and most of the operations would involve restarting the operating system.
- Data reliability: data is a valuable asset for AI companies, and data that algorithm engineers may use should be sufficiently stable and secure. Lustre currently does not support file system-level redundancy (FLR), and can only resist hard drive failure through hardware RAID.
- Multi-level caching within Client: To build a large-scale data storage system (petabytes or more), most of the data will be stored on HDDs for cost considerations. In order to automatically distinguish hot data from cold ones and to leverage the near-terabyte memory of our GPU servers with a large number of independent SSD disks, we want to have multi-level automatic caching capabilities within the client to cope with highly I/O intensive read and write scenarios.
- Community activeness: Community activeness is also a factor we consider, as an active community can be more responsive in terms of feature iteration and bug fix.
Getting to know JuiceFS
In early 2021, we learned about JuiceFS and contacted Juicedata (the company behind JuiceFS) and conducted PoC, and now JuiceFS is live in the production environment, and we joined the JuiceFS open source community as well.
JuiceFS architecture and its advantages
JuiceFS consists of a metadata engine, an object storage, and a JuiceFS client, where the metadata engine and object storage provide multiple options for users to choose from. The data stored by JuiceFS will be persisted to object storage (e.g. Amazon S3), and the corresponding metadata can be persisted to various database engines such as Redis, MySQL, SQLite and TiKV depending on the scenario and requirements.
Whether for metadata engine or object storage there are many proven solutions, and if used on the public cloud there are usually fully managed services available out-of-the-box. JuiceFS' automatic metadata backup, recycle bin and other features guarantee data reliability to a certain extent and avoid data loss due to some unexpected situations. However if you operate and maintain your own metadata engine and object storage, you still need to backup your data. JuiceFS's local caching feature automatically caches frequently accessed data in memory and on disk, and also caches file metadata.
The majority job of PoC was to verify the feasibility in a small-scale environment, focusing on product features, operations and maintenance mode, whether it is feasible to interface with upstream scheduler and business framework, etc.
For the PoC, we built a single-node Redis + 3-nodes Ceph object storage cluster. Since Redis and Ceph are relatively mature, the deployment and O&M solution can also refer to the full range of materials, while the JuiceFS client can connect to Redis and Ceph in a relatively simple way, the overall deployment process is very smooth.
JuiceFS is fully compatible with POSIX, so our upper layer business can switch and dock seamlessly and users are not aware of it . JuiceFS also supports cloud native usage in the form of CSI Driver, which fits in with our entire platform technology stack.
In terms of performance testing, we conducted text recognition model training in the test environment. The experimental environment is: the model uses the server version of the Chinese recognition model and the backbone is ResNet-18, the overall total amount of data is 98G using LMDB format storage, three sets of experiments were conducted using 6 NVIDIA Tesla V100, respectively: directly read data in Lustre, read on JuiceFS with 200G memory cache, and read on JuiceFS with 960G SSD cache. The results of the experiments are as follows:
JuiceFS client has multi-level cache function, which can bring a big performance improvement in data reading, with more than 1x performance improvement compared to Lustre, which is a good fit with our business.
Taking into account the operational costs, business fit and performance, we decided to run JuiceFS in production environments.
Scenarios and benefits of JuiceFS in Atlas
AI model training acceleration
JuiceFS client’s multi-level cache is currently used in text recognition, speech noise supression and speech recognition scenarios. Since the data access patterns of AI model training are characterized by more reads and fewer writes, we make full use of the client-side cache to bring the acceleration benefit of IO reads.
Noise Suppression Test
The data used in the test of noise reduction scenario are unmerged raw files, each data is a small file of less than 100KB and in WAV format. We tested the I/O performance of the data load phase with a memory cache of 512G on the JuiceFS client node and a batch size of 40 for a 500 hours size data.
According to the test results, in terms of data reading efficiency alone, JuiceFS is 6.45 it/s for small WAV files, while Lustre is 5.15 it/s, a 25% performance improvement. JuiceFS effectively accelerates our end-to-end model training and reduces model output time overall.
Text recognition scenarios
In the text recognition scenario, the model is CRNN and the backbone is MobileNetV2. The test environment is as follows:
In this test, we have done the speed comparison between JuiceFS and Lustre. From the experimental results, it takes 1.5s to read each batch from Lustre and 1.1s to read each batch from JuiceFS, which is a 36% improvement. In terms of model convergence time, it decreases from 96 hours for Lustre to 86 hours for JuiceFS. Using JuiceFS can reduce the output time of CRNN model by 10 hours.
Acceleration of AI model development
Before algorithm engineers submit an AI model training task to a supercomputing cluster, the model needs to undergo extensive debugging. We provide users with a debugging environment. The Dev Node uses the same storage as the formal Atlas training cluster, and the JuiceFS client is mounted in both the development and training nodes , so changes made on the development machine can be migrated to the Atlas training cluster seamlessly.
Users can choose the development environment flexibly, either with Anaconda on the host for remote debugging, or using the container to run the development environment. The frameworks most users use to debug their models are PyTorch and TensorFlow. We found that users need to import Python packages frequently, such as numpy, torch, etc., when debugging, and such packages are composed of a large number of small files. Based on the old storage system, it took seconds or tens of seconds for a user to import a package.
The efficiency of model debugging is relatively low according to the algorithm engineers. As a unified development environment, it is accompanied by a large amount of installation package import, code compilation, log reading and writing, and sample downloading, which requires the Dev Node to have both high throughput and the ability to quickly handle a large number of small files.
By introducing JuiceFS, we mounted the JuiceFS client on the development machine, which uses a metadata cache as well as a data read cache mechanism when mounted. In terms of metadata caching, when a JuiceFS client opens a file using the open() operation, its file attributes are automatically cached in the client's memory, and subsequent getattr() and open() operations are immediately returned from the memory cache as long as the cache is not invalidated. The chunk and slice information of the file is also automatically cached in the client memory when the read() operation is performed. We use memory as the cache medium for data caching, so that all Python packages debugged by the user are cached in memory after the first import, and the files are read directly from the memory cache during the second debugging.
Compared to the previous approach, the overall speedup is 2-4 times faster, which greatly improves the debugging efficiency and user experience.
How JuiceFS was deployed in Atlas
In terms of data storage management, we use a management method that is compatible with existing distributed storage systems. The nodes in the JuiceFS cluster are also connected to LDAP, and each node will interact with the LDAP Server cluster through the LDAP client for authentication. Each group on the HPC platform belongs to a different directory, and under each directory are the members within their respective groups or departments, and the directories between different groups are not visible. Permissions for directories are based on the Linux permission control mechanism.
When a user submits a training task in the Atlas, the cluster's task submission tool automatically reads the user's UID and GID information on the system and injects it into the securityContext field of the task Pod submitted by the user, so that the UIDs of all container processes running in the container Pods on the Atlas are consistent with the information on the storage system, ensuring that permissions are not crossed.
Currently, there are 2 ways to access data:
- One is to access data through the compute node's HostPath,
- The other is a more cloud native approach by combining JuiceFS with Fluid, which provides data access and acceleration for Atlas applications.
The first one is still the same as the previous way of accessing the distributed file storage system, directly accessing the local storage system client by way of Kubernetes HostPath.
We have deployed JuiceFS clients on all CPU and GPU compute nodes, and users need to specify the Kubernetes volume as HostPath when submitting compute tasks to map the JuiceFS directory. The cache management in this way is rather rough, and the cache cannot be managed on the user side.
Fluid + JuiceFS
The second way is a combination of Fluid + JuiceFS, here is a brief explanation of the architecture.
Fluid launches JuiceFS-related components including FUSE and the Worker pod. The FUSE Pod provides caching capabilities for JuiceFS client, while the Worker Pod enables management of the cache lifecycle.
The Atlas platform's AI offline training tasks read training data by interacting with the FUSE Pod client. With the cache scheduling capabilities provided by Fluid and the observability of data sets, users of the platform can deploy caches on specific compute nodes through affinity scheduling, while users can visually see cache usage like the size of the cached data set, the percentage of the cache, the capacity of the cache, etc.
JuiceFS In Production Environment
We currently use Redis as the metadata engine for our production environment. The system disk of the Redis node is RAID1, while the Redis persistent data is periodically synchronized to another backup node. For Redis data persistence, we use the AOF + RDB scheme to persist data once per second, with the following configuration：
appendonly yes appendfsync everysec aof-load-truncated yes
Since our node uses 100G InifiBand, the IB's NIC driver（https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed） requires users to download the corresponding ISO according to their OS version. Currently our node is using Kernel version 5.4, due to the strong coupling between IB driver and OS and Kernel version, when we upgrade Kernel to version 5.4, the driver needs to be recompiled and installed. Note that the GCC version when compiling the driver version MLNX_OFED_LINUX-5.5-220.127.116.11-rhel7.6-x86_64.iso must be GCC 9, otherwise the compilation process will have various inexplicable problems.
# Install GCC9 yum --enablerepo=extras install centos-release-scl-rh yum install devtoolset-9-gcc scl enable devtoolset-9 bash # Compile IB driver mount /dev/sr0 ib ./mlnx_add_kernel_support.sh -m /root/ib -k (kernel version)
The object storage is a self-built Ceph cluster. Ceph clusters are deployed by Cephadm, and the current production environment is the Octopus version. Cephadm is an installation tool released with the new version of Ceph v15.2.0 (Octopus), and does not support older versions of Ceph and Cephadm does not rely on external configuration tools such as Ansible, Rook, and Salt. It does this by connecting the manager daemon process to the host via SSH. The manager daemon process can add, remove, and update Ceph containers.
To bootstrap a single-node cluster via Cephadm, Cephadm deploys mgr and mon services on the node where bootstrap is performed. When other nodes are added, the mgr management node will be automatically deployed in one of them. We currently use 2 management nodes and 3 monitoring nodes in our production.
In terms of Ceph tuning, we borrowed the solutions shared by other users in the community and thanked Ctrip's engineers for their help in our tuning process, mainly doing the following practices.
- 42Cores 256GB 24*18T HDD
- System Disk: 2* 960G SAS SSD
- Turn off NUMA
- Upgrade kernel: 5.4.146 Enable io_uring
- Kernel pid max, modify /proc/sys/kernel/pid_max
- Ceph RADOS: direct calls to librados interface, no S3 protocol
- Bucket shard
- Disable auto-tuning of pg
- OSD log storage (using BlueStore, recommended bare capacity ratio - block : block.db : block.wal = 100:1:1, SSD or NVMe SSD recommended for the last two)
- 3 replicas
The object storage that JuiceFS interfaces with in our environment is Ceph RADOS. JuiceFS uses librados to interact with Ceph, so you need to recompile the JuiceFS client, and it is recommended that the version of librados corresponds to that of Ceph. For example, in our environment Ceph version is Octopus (v15.2.*), the recommended version of librados is v15.2.*. CentOS comes with a lower version of librados, so we can download the corresponding package from the official website （http://fr.ceph.com/rpm-15.2.10/el7/x86_64/）. On our environment we only need to download librados2-15.2.10-0.el7.x86_64.rpm and librados-devel-15.2.10-0.el7.x86_64.rpm.
Then run the following command to install:
yum localinstall -y librad*
After installing librados you can compile the JuiceFS client (Go 1.17+, GCC 5.4+ recommended):
After compiling JuiceFS, you can create the file system and mount the JuiceFS client on the compute node. At present, JuiceFS is still used in our production environment mostly through Kubernetes' HostPath, so we have mounted JuiceFS clients in each GPU and CPU node and managed the JuiceFS mounting process through systemctl to achieve automatic mounting and recovery from failure.
Finally, to summarize the features and scenarios of Lustre and JuiceFS:
- Lustre has years of experience in production environments as a storage system in the area of HPC, powering many of the world's largest supercomputing systems. It has the advantages of conforming to the POSIX standard, supporting various high-performance and low-latency networks, and allowing RDMA access, and is suitable for high-performance computing in the field of traditional HPC, but the adaptation to cloud native scenarios is not perfect. And currently it can only be used through HostPath Volume, and it runs on top of the Linux kernel, which has high demands on O&M personnel.
- As a distributed storage system in the cloud native area, JuiceFS provides CSI Driver and Fluid to enable better integration with Kubernetes. In terms of deployment , it provides users with more flexible options, users can choose either on the cloud or on-premises deployment, which is simpler in terms of storage capacity expansion and maintenance. Full compatibility with the POSIX standard allows deep learning applications to migrate seamlessly, but due to the characteristics of the object storage, which has a high latency in random write, we will use the client's multi-level cache to accelerate in read-only scenarios.
Users can make the appropriate choice according to their business scenarios, O&M capabilities, and storage scale.
The future plans for the Atlas platform in relation to JuiceFS:
- Metadata engine upgrade: TiKV is suitable for scenarios with more than 100 million files (up to 10 billion files) and high requirements for performance and data security, we have completed the internal testing of TiKV and are actively following the progress of the community, and will migrate the metadata engine to TiKV.
- Directory (project) based quotas: The JuiceFS community edition does not support directory based quotas currently, each of our departments are under different directories in JuiceFS and need to do restrictions on directory quotas. The JuiceFS community edition is already planning to implement this feature and will be officially released in the version after v1.0.0.
Thanks to the technical support provided by the JuiceFS open source community in the process of building the efficient storage for the Atlas, Unisound is also actively conducting internal testing and aiming to return the developed features and improvements to the open source community in the future.