Improving Read Performance by ~30% in AI Speech and Text Processing by a Distributed Storage System

2022-09-06
Dongdong Lv

Note: This post was published on DZone.

Unisound is an AI unicorn startup, specializing in speech recognition and artificial intelligence (AI) services. Our high-performance computing (HPC) platform, Atlas, supports the company with powerful computing capabilities in various AI domains. Its deep learning computing power exceeds 57 PFLOPS (570 million times per second).

However, finding a suitable storage system was challenging for us. After trying CephFS, Lustre, and Alluxio+Fluid, we found that JuiceFS, an open-source cloud-native distributed file system, outperformed Lustre in system performance and was easy to maintain.

In our test, JuiceFS achieved read performance improvements of 36% and 25% in text recognition and noise suppression, respectively. Therefore, we adopted it in our production environment and applied its multi-level cache in text recognition, speech noise suppression, and speech recognition.

In this post, I’ll share our storage challenge, the solution we chose, benefits we've experienced, lessons we learned from comparing Lustre and JuiceFS, and our future plans. I hope this post can help you choose a suitable storage system based on your application needs and storage scale.

Our storage challenge

The Atlas architecture
The Atlas architecture

As an HPC platform, Atlas requires powerful computing capabilities and an efficient storage system. Our storage system must meet the following requirements:

  • Support for multiple types of structured and unstructured data storage
  • Compatibility with POSIX
  • High performance in scenarios with massive numbers of small files

Previously, we tried CephFS, Lustre, and Alluxio+Fluid, but they were not ideal:

  • When there were tens of millions of small files, CephFS had serious performance problems. When users operated files, the system would be unresponsive or even freeze, especially in high I/O scenarios.
  • Lustre had performance bottlenecks under high concurrent requests. It could not meet the needs of scenarios with high bandwidth and IOPS requirements.

Troubleshooting was also difficult, because the Lustre code ran in the Linux kernel. Most of these operations required restarting the operating system.

Lustre did not support file system-level redundancy (FLR). It could only resist hard drive failure through hardware RAID.

  • Alluxio+Fluid reduced the total storage system bandwidth despite speeding up our AI model training with its distributed cache for I/O acceleration.

Therefore, we looked for a storage system that met the following requirements:

  • Easy operations and maintenance. It should be easy to use, scale out and troubleshoot.
  • Reliable data. Data is a valuable asset for AI companies, and data that algorithm engineers use must be stable and safe.
  • Multi-level client caching. To build a cost-effective large-scale data storage system (of petabyte-scale or above), we store the majority of the data on HDDs. To automatically distinguish between hot and cold data and fully utilize the near-terabyte memory and independent SSDs of our GPU servers, we require multi-level automatic caching capabilities within the client to handle intensive I/O read and write scenarios.
  • An active community, which leads to fast responses in terms of feature updates and bug fixes.

A distributed file system as the solution

JuiceFS is an open-source, high-performance, distributed file system designed for the cloud, compatible with POSIX, HDFS, and S3 API. Our proof of concept (PoC) test demonstrated that JuiceFS outperformed Lustre in terms of performance. Additionally, it is easy to operate and maintain, making it a great fit for our application. Therefore, we deployed JuiceFS in our production environment.

The JuiceFS architecture

JuiceFS consists of:

  • A metadata engine
  • An object storage
  • A JuiceFS client
The JuiceFS architecture
The JuiceFS architecture

JuiceFS separates data and metadata storage. Files are split into chunks and stored in object storage like Amazon S3, and the corresponding metadata can be stored in various databases such as Redis, MySQL, TiKV, and SQLite, based on the scenarios and requirements.

JuiceFS advantages

  • Many proven solutions for metadata engine and object storage.
  • Fully managed services available out-of-the-box on public cloud.
  • Automatic metadata backup, recycle bin, and other features guarantee data reliability to avoid data loss.
  • Local caching automatically caches frequently accessed data in memory and on disks, and also caches file metadata.

JuiceFS vs. Lustre in our PoC test

We conducted a PoC to test its product features, operations and maintenance, and its feasibility to interface with upstream scheduling and application frameworks.

The PoC environment was a single-node Redis plus 3-node Ceph object storage cluster. We connected the JuiceFS client to Redis and Ceph easily.

We found that JuiceFS was suitable for our application:

  • It was fully compatible with POSIX, so our upper layer application could seamlessly switch to it. The application was transparent to users.
  • JuiceFS supported CSI Driver, which fit in with our technology stack.

To test JuiceFS performance, we trained the text recognition model in the test environment with:

  • The server version of the Chinese recognition model
  • The ResNet-18 backbone
  • 98 GB of data stored in LMDB format

We conducted the following 3 tests on 6 NVIDIA Tesla V100 and got the results as shown in the figure below:

  • Reads in Lustre (“Lustre” in the figure below)
  • Reads in JuiceFS with 200 GB memory cache (“jfs-mem” in the figure below)
  • Reads in JuiceFS with 960 GB SSD cache (“jfs-ssd” in the figure below)
Comparison of time consumption
Comparison of time consumption

The tests showed that JuiceFS' data read performance was over twice that of Lustre due to its client's multi-level caching capability.

How we benefit from JuiceFS

We apply the JuiceFS client’s multi-level cache in text recognition, speech noise suppression, and speech recognition scenarios. AI model training usually involves frequent reads and infrequent writes. Therefore, we fully leverage the JuiceFS client cache to accelerate I/O reads.

Benefit #1: accelerating AI model training

In the noise suppression scenario

The data used in the noise reduction test were unmerged raw files. Each data file was a small audio file of less than 100 KB in WAVE format. We focused on testing the data loading I/O performance of JuiceFS client nodes, which have a memory cache of 512 GB. We conducted the tests using a batch size of 40, with a dataset of 500 hours.

The following figure shows the results:

Lustre vs. JuiceFS in data processing performance
Lustre vs. JuiceFS in data processing performance

According to the test results, JuiceFS demonstrated a 25% performance improvement in data reading efficiency for small WAV files, achieving a speed of 6.45 it/s compared to Lustre's 5.15 it/s. This effective acceleration with JuiceFS contributed to our end-to-end model training and significantly reduced our model output time.

In the text recognition scenario

The test environment for the text recognition scenario was as follows:

Model CRNN with MobileNetV2 backbone
Deep learning framework PyTorch
Data size 50G LMDB data generated from 3*32*320 images
Data type LMDB
GPU server 6 * NVIDIA Tesla V100 / 512G MEM / 56 CPU Cores
Cache 200 GB
Batch size 6*400

This figure shows the test results:

Lustre vs. JuiceFS in data reading speed and model training time
Lustre vs. JuiceFS in data reading speed and model training time

According to the test results, JuiceFS read each batch faster than Lustre, taking 1.1 s and 1.5 s respectively, a 36% performance improvement. Moreover, JuiceFS reduced the model's convergence time from 96 hours on Lustre to 86 hours on JuiceFS, a 10-hour decrease in CRNN model output time.

Benefit #2: accelerating AI model development

Model development at Atlas
Model development at Atlas

The debugging environment

Before submitting an AI model training task to a HPC cluster, an algorithm engineer must extensively debug the model. To aid this process, we provide a debugging environment for users. The Dev Node shares the same storage as the formal Atlas training cluster. It has a JuiceFS client mounted, allowing seamless migration of changes made on the development machine to the Atlas training cluster.

Users can choose their development environment, either by using Anaconda on the host for remote debugging or by running the development environment in a container.

Debugging platforms

PyTorch and TensorFlow are the most commonly used platforms for debugging. But frequent imports of Python packages such as numpy and torch, which are composed of a large number of small files, can cause significant delays. Under the old storage system, this process could take several seconds or even tens of seconds.

Challenges

Algorithm engineers reported that the model debugging efficiency was low. To address these issues, the Dev Node needed high throughput and the ability to quickly handle a large number of small files, including:

  • Installation package imports
  • Code compilation
  • Log reading and writing
  • Sample downloading

JuiceFS as the solution

To solve these problems, we introduced JuiceFS, which offers the following benefits:

  • Metadata caching: file attributes are automatically cached in the client's memory when the JuiceFS client opens a file using the `open()` operation. Subsequent `getattr()` and `open()` operations immediately return results from the memory cache as long as the cache is not invalidated. The chunk and slice information of the file is also cached in the client memory when the `read()` operation is performed.
  • Data caching: we use memory as the cache medium, which caches all Python packages debugged by the user in memory after the first import. This enables the files to be read directly from the memory cache during the second debugging.

JuiceFS helped us achieve an overall speedup of 2-4 times. This significantly improved debugging efficiency and user experience.

Lessons learned: Lustre vs. JuiceFS

We made a comparison between Lustre and JuiceFS and summarized their advantages and disadvantages for your reference. You can make your own choice based on your business needs, operation and maintenance capability, and storage scale.

Lustre

Advantages:

  • It has years of experience in production environments as a storage system in the area of HPC, powering many of the world's largest supercomputing systems.
  • It is compliant with the POSIX standard and supports various high-performance, low-latency networks.
  • It allows RDMA access.

Disadvantages:

  • It is not well adapted to cloud-native scenarios and can only be integrated with HostPath Volume.
  • It runs on top of the Linux kernel, which has high demands on operations and maintenance staff.

JuiceFS

Advantages:

  • As a distributed storage system in the cloud-native area, JuiceFS provides CSI Driver and Fluid to enable better integration with Kubernetes.
  • It provides users with more flexible options. Users can choose either on the cloud or on-premises deployment. It’s easy to maintain and scale storage capacity.
  • It is fully compatible with POSIX, allowing seamless migration of deep learning applications.

Disadvantage:

  • It may have a high latency in random writes, due to its object storage characteristics. To solve this issue in read-only scenarios, we use the client's multi-level cache to accelerate reads.

What’s next

Our future plans for JuiceFS at the Atlas platform:

  • Metadata engine upgrade: TiKV is suitable for scenarios with more than 100 million files (up to 10 billion files) and high requirements for performance and data security. We have completed our tests on TiKV and will migrate the metadata engine to TiKV.
  • Directory (project) based quotas: The JuiceFS community edition does not support directory based quotas. Our departments are in different directories of JuiceFS, and we need to make restrictions on directory quotas. The JuiceFS community edition plans to implement directory based quotas, which will be released in the version later than v1.0.0.

If you have any questions or would like to learn more, feel free to join our discussions on GitHub and community on Slack.

Related Posts

LLM Storage: Performance, Cost, and Multi-Cloud Architecture

2024-04-09
Learn how JuiceFS tackles challenges in large language model storage, balancing performance and cos…

How We Ensure Stable Storage for LLM Training in Multi-Cloud Architecture

2024-04-03
Learn how Zhihu, China’s Quora, ensures stable LLM training and seamless petabyte-scale data migrat…

98% GPU Utilization Achieved in 1k GPU-Scale AI Training Using Distributed Cache

2024-03-07
Learn how JuiceFS achieved over 98% GPU utilization in 1,000 GPU-scale AI training using distribute…

Reducing LLM Loading Time from 20+ to a Few Minutes with a Distributed File System

2024-02-29
Learn how BentoML, a platform for building LLM AI apps, accelerates model loading using JuiceFS in …