Building an Easy-to-Operate AI Training Platform: Storage Selection and Best Practices

2023-12-14
Jichuan Sun

SmartMore, an AI unicorn, focuses on smart manufacturing and digital innovation. Our solutions enable manufacturing optimization and automation, enhancing cost-efficiency, manufacturing speed, and quality control. We apply technologies like deep learning and automation equipment to implement best practices. So far, we’ve served 200+ top enterprises like P&G and Unilever, covering 1,000+ applications.

As our businesses grew, data volumes rose, presenting our storage platform with new challenges: high throughput for large images, IOPS issues with tens of millions of small files in super-resolution, and intricate operations. Besides, we had only one staff managing storage operations, so we looked for an easy-to-use storage solution. We tested file systems like NFS, GlusterFS, Lustre, and CephFS, and chose JuiceFS.

This article covers:

  • The storage challenges of our industrial AI platform
  • Key takeaways for selecting storage solutions
  • Our choice of SeaweedFS for JuiceFS' underlying storage
  • How we apply JuiceFS at SmartMore
  • Insights from our storage component management experience

Our goal is to provide valuable guidance for informed storage product decisions.

Challenges in AI training platform storage

SmartMore specializes in industrial applications, covering areas like quality inspection, smart manufacturing, and process optimization. Our team handles AI platform training scenes and provides training platform support for all business lines. We manage the company's GPU training cluster. The data we handle displays these features:

  • For individual projects, data volumes are small, making cached file systems like JuiceFS and BeeGFS preferable due to their high caching benefits. In industrial contexts, data volumes are small compared to applications like facial recognition or natural language processing. Our customers might provide just a few hundred images, which we use for iterative model development. With fewer project images, local caching becomes advantageous.

  • Image sizes vary widely without standardized specifications. Different factories, production lines, and cameras generate images of varying sizes. Line scan camera images or BMP format pictures are notably large, often several hundred MBs or even GBs. Conversely, videos or photos from other cameras might be just a few KBs in size. This presents both throughput challenges for large files and IOPS challenges for massive small files.

  • Data volumes and file counts escalate rapidly, resulting in significant metadata pressure. A project may need to handle several terabytes of data in a single week. In another scenario, super-resolution datasets convert videos into images, potentially generating millions of files. This volume of files places considerable metadata pressure on file systems. Consequently, our training platform demands substantial storage capacity and long-term metadata stability during critical periods.

  • Operating within a self-built IDC without cloud services underscores the importance of storage system maintenance and scalability.

Lessons learned for choosing a storage solution: more than functionalities and performance

To tackle these challenges, we evaluated various file systems like NFS, GlusterFS, Lustre, CephFS, and JuiceFS. Ultimately, we chose JuiceFS. Balancing functionalities, performance, costs, and business communication, we derived important insights and key conclusions from this process.

Lesson 1: Striking the right balance: cost, reliability, and performance

There's a delicate balance between cost, reliability, and performance. Unless the company has substantial funds, it's often feasible to prioritize two factors out of cost, reliability, and performance, as achieving all three concurrently is challenging. When cost is a constraint, we recommend placing data safety higher in priority than performance.

Drawing from our experience in maintaining multiple clusters, we've learned that achieving acceptable performance levels suffices; an excessive pursuit of high performance might not be necessary. Our understanding of total cost of ownership (TCO) extends beyond merely considering capacity and performance prices, emphasizing lower actual holding costs through efficient resource utilization. Spending significantly on high performance that remains underutilized equates to wastage.

Lesson 2: Early user engagement to close the information gap

Onboarding internal users early is crucial due to the existing gap between technical and application engineers. For example, after deploying our initial storage system, we encountered numerous user complaints. Initially, we simply perceived offline training tasks, whether for training or processing large datasets, as aiming for high utilization and throughput. However, in fact, users were highly sensitive to latency. Therefore, initiating communication with users early to understand their needs and feedback helps bridge this information gap.

Lesson 3: Ensuring platform consistency for task execution

Ensuring platform consistency goes beyond uniform environments; it extends to consistent task execution. For example, large image tasks consuming most of the bandwidth ("elephant flow") online can impact completion times for small image tasks. Even with machines of similar configurations, completion times for small image tasks can double. Users might complain about why a task takes 6 hours on this machine while it takes only 3 hours on another. Applying mechanisms like quality of service (QOS) for each task is advisable to prevent such disparities, providing a smoother user experience.

Lesson 4: Choosing products with comprehensive tools and integration

We recommend opting for products with rich toolsets and robust integrations rather than focusing solely on documentation quantity. Our exploration of various commercial storage solutions revealed documentation inconsistencies with their respective versions. Consequently, we lean towards products with ample peripheral tools, like debugging and monitoring tools, as self-reliance often proves more reliable.

One of the reasons we chose JuiceFS is its rarity in offering performance debugging tools for file systems, such as its juice stats and access log tools. Furthermore, JuiceFS integrates with the continuous profiling platform Pyroscope. This allows continuous monitoring of garbage collection times and resource-intensive blocks, offering practical utility despite less frequent use.

Lesson 5: Effective communication with users in their language

When communicating with users, employ "user language" that's easily understood, rather than technical jargon. During product delivery, we demonstrated a bandwidth of 5 GB per second at an I/O size of 256k, fully using the network card's capacity. However, it’s not comprehensible to algorithm users.

An article about JuiceFS enlightened us. It demonstrated completion times for training the ResNet-50 model with the ImageNet dataset on JuiceFS, alongside comparisons to other file systems. This approach simplified performance assessment and comprehension for algorithm users. After sharing this comparative data, communication with internal users became smoother, as shorter training times allowed more parameter adjustments, experimentation, and superior model creation.

Lesson 6: Adapting to real-world complexity through ongoing enhancements

For a large platform, it's difficult to simulate all usage scenarios in testing environments. The key is to deliver and refine continuously. For instance, after the launch , we realized users' behavior didn't align with our expectations. They stored not only datasets but also installed environments like Anaconda on the storage system. These libraries consist of small files, demanding ultra-low latency. Without local caching, performance would be unacceptable. Such scenarios, unexpected during our testing phase, underline the importance of broadening the test scope early. We must involve more users in testing and real-world applications to unveil more issues.

Lesson 7: Testing long-term stability and costs of all-in-memory metadata storage

Regarding all-in-memory metadata storage, if metadata becomes excessively large, it may hinder mixed deployment. The high cost of machine ownership should not be overlooked, and long-term stability under critical conditions should be tested thoroughly.

Why we chose SeaweedFS as JuiceFS' underlying storage

SeaweedFS is a vital component in our storage layer. Before we delve into JuiceFS' use cases, it's important to understand why we chose SeaweedFS as the underlying storage solution for JuiceFS. This decision was made after carefully weighing SeaweedFS' pros and cons.

In architecture selection, it's vital to avoid hasty conclusions and conduct extensive scenarios and limited-scale testing to assess all factors. When we select a file system, our decision-making should encompass broader implications for the team's future planning and development, avoiding tunnel vision.

Why we rejected CephFS and MinIO

  • Our team lacked CephFS expertise.
  • When file count exceeded a hundred million, performance degradation occurred with our initial MinIO+JuiceFS setup. In 2021, we began using JuiceFS, with MinIO as the underlying storage. But for massive file volumes, the XFS file system's performance declined by at least 30% when surpassing 100 million files.
  • A simple architecture was crucial for our small team (less than 10 staff) managing diverse projects and tasks.

SeaweedFS’ advantages

We explored various open-source and commercial storage solutions and ultimately chose SeaweedFS as JuiceFS' underlying storage due to the following reasons:

  • SeaweedFS’ functionalities can be initiated through a simple command. It didn't abstract critical deployment and architectural details, allowing flexibility for our developers to make modifications.
  • SeaweedFS employs a Haystack-like architecture. It aggregates random write operations into sequential writes, which is hard disk friendly. Despite similar HDD and SSD pricing, the architecture, developed in 2021 during a peak in hard drive costs due to cryptocurrency mining, remains beneficial.
  • The Haystack architecture supports small file merging, eliminating performance degradation associated with numerous files—unlike MinIO, which negatively optimizes small files by splitting them into data and metadata components, further segmented during erasure coding processes.
  • SeaweedFS offers S3 compatibility, facilitating integration with JuiceFS.

SeaweedFS’ disadvantages

SeaweedFS has the following limitations:

  • The available resources and documentation might not be as extensive as desired. Understanding certain features often requires delving into the source code directly.
  • The community support is limited, and there are usability issues with some less common functionalities. In our real-world applications, we've observed that only the I/O functionality consistently performs as expected, and other functionalities do not work well. For example:
    • While the replication mechanism proved feasible, transitioning from replication to erasure coding presented challenges, and ongoing research is needed in this area.
    • We found challenges with SeaweedFS’ cross-datacenter synchronization functionality. Our attempts to test this feature for multi-datacenter synchronization and backups proved time-consuming and unsuccessful.
    • SeaweedFS offers data tiering functionality, which we also encountered difficulties testing successfully.

How we use JuiceFS at SmartMore

We primarily employ JuiceFS in the following scenarios:

  • Small-scale storage using Redis and MinIO
  • Large-scale storage using SeaweedFS and TiKV
  • Multi-cluster management

Scenario 1: Small-scale storage with Redis and MinIO

In cases where storage capacity ranges from a few million to tens of millions of files, totaling in hundreds of terabytes, we opted for Redis + MinIO (SSD) + JuiceFS. This combination proved highly effective and required minimal day-to-day attention. We use a single-point Redis setup, which has remained stable for over a year without significant downtime. We only need to monitor memory usage closely and maintain control over the number of files.

Scenario 2: Large-scale storage with SeaweedFS and TiKV

For scenarios involving data volumes reaching the petabyte level, cost constraints led us to choose HDD. To address this, we employed SeaweedFS (HDD) + TiKV + JuiceFS as the solution.

Scenario 3: Multi-cluster management

In this scenario, each user deploys a set of JuiceFS, with a common underlying SeaweedFS layer. This was driven by two main reasons:

  • JuiceFS v0.17 lacked directory quota functionality. This led some algorithm engineers to occupy excessive directory space.
  • While waiting for feature updates or developing our own quota mechanism wasn't feasible, our team's expertise primarily lay in cloud-native scheduling rather than storage system development.

As a solution, we embraced a multi-cluster management approach using TiKV and SeaweedFS. We assigned independent metadata prefixes to individual users, creating separate clusters. This approach offered an advantage: if a portion of TiKV experienced downtime in a scaled setup, the overall impact was mitigated. However, managing multiple file systems in this manner introduced complexities due to the sheer number of clusters.

Insights and learnings from storage component management

In our storage component management journey, we've gained valuable insights into Redis, TiKV, MinIO, and SeaweedFS.

Comparable latency for Redis and TiKV

Redis and TiKV demonstrate nearly equivalent latency in non-critical conditions. It can be challenging to discern their performance disparities without meticulous measurement. In AI scenarios with more reads than writes, performance gaps may emerge for metadata operations like mkdir and rename due to locking requirements. However, in scenarios dominated by read operations, monitoring indicates Redis and TiKV exhibit similar performance.

Context-driven decision-making

Choosing Redis or TiKV necessitates assessing unique circumstances. While Redis offers simpler operations and maintenance, it carries redundancy and high availability risks. For companies with stringent service-level agreement (SLA) requirements, TiKV proves a more suitable choice, as Redis isn't optimally equipped for addressing such concerns.

TiKV documentation insights

TiKV is a nice open-source project, but it has limited documentation. The documentation, presented alongside the TiDB official website, lists parameters without elucidating their system impact. To comprehend these parameters, systematic experimentation is required.

User-centric approach to MinIO and SeaweedFS

Practical experience leads us to affirm MinIO's superior user-friendliness and data redundancy compared to SeaweedFS. In scenarios involving fewer files, we strongly recommend MinIO. Some of our operations with lenient real-time and availability requirements combine Redis and MinIO. Only when dealing with larger data volumes, we switch to TiKV+SeaweedFS.

Conclusion

Our journey at SmartMore has led us to crucial insights into AI-domain storage solutions. The challenges we've encountered emphasize the strategic importance of selecting robust options. In this pursuit, we found JuiceFS to be a balanced choice, addressing functionalities, performance, and costs effectively.

Our experiences have highlighted the delicate balance between cost, reliability, and performance. Early user engagement, platform consistency, and clear communication emerged as key factors in maintaining efficient operations. The choice of SeaweedFS as the underlying storage for JuiceFS showcases our meticulous approach to decision-making.

Insights gained from managing various storage components such as Redis, TiKV, MinIO, and SeaweedFS have enhanced our overall understanding. This understanding equips us to tailor storage solutions to optimize operational efficiency.

In this article, we haven't introduced our architecture. If you’re curious about our architecture, you can take a look at Unisound’s case study. Generally, our architectures are quite similar. If you have any questions or would like to learn more, feel free to join JuiceFS discussions on GitHub and their community on Slack.

Author

Jichuan Sun
Jichuan Sun is the Head of Infrastructure Team (Ti Team) at SmartMore, responsible for the design and maintenance of the computing and storage infrastructure platform.

Related Posts

LLM Storage: Performance, Cost, and Multi-Cloud Architecture

2024-04-09
Learn how JuiceFS tackles challenges in large language model storage, balancing performance and cos…

How We Ensure Stable Storage for LLM Training in Multi-Cloud Architecture

2024-04-03
Learn how Zhihu, China’s Quora, ensures stable LLM training and seamless petabyte-scale data migrat…

98% GPU Utilization Achieved in 1k GPU-Scale AI Training Using Distributed Cache

2024-03-07
Learn how JuiceFS achieved over 98% GPU utilization in 1,000 GPU-scale AI training using distribute…

Reducing LLM Loading Time from 20+ to a Few Minutes with a Distributed File System

2024-02-29
Learn how BentoML, a platform for building LLM AI apps, accelerates model loading using JuiceFS in …