How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS

2026-03-05

Han Zhao

D-Robotics, founded in 2024 and spun off from Horizon Robotics' robotics division, specializes in the research and development of foundational computing platforms for consumer-grade robots. In 2025, we released an embodied AI foundation model.

In robot data management, training, and inference, the sheer volume of data is immense. Using object storage presents challenges such as handling small files and managing multi-cloud data. After trying some solutions and replacing private MinIO with SSD storage, we still faced difficulties in addressing these challenges. Ultimately, we selected JuiceFS as our core storage solution.

JuiceFS' inherent adaptability for cross-cloud operations efficiently supports data sharing needs in multi-cloud environments. In training scenarios, JuiceFS' cache mechanism, specifically designed for small file data, effectively replaces traditional caching solutions while achieving a cost-effective balance between cost and efficiency, fully meeting storage performance requirements. Currently, we manage tens of millions of files.

In this article, we’ll share our application characteristics, storage pain points, solution selection, implementation practices, and production tuning experiences. We hope our experience offers useful insights for those facing similar challenges in the industry.

Storage pain points in the robotics industry

The cloud platform serves as our core technical hub, undertaking key application functions such as simulation environment setup, data generation and model training, model lightweighting and deployment, and visual verification. The data types involved in the platform are diverse, mainly including sensor image data, LiDAR point cloud data, model weights and configuration data, motor operational data, and map construction data.

While object storage meets basic storage needs for massive data, its performance limitations become particularly obvious when handling the massive small files frequently encountered in robotics applications. Our storage system faced four challenges:

Metadata performance bottleneck with massive small files: Robot model training involves tens of millions to billions of sensor images, LiDAR data, and model files. Traditional object storage (like standard S3) exhibits significant metadata operation bottlenecks at this scale. The fixed API latency for routine operations like listing files or retrieving attributes is typically 10–30 ms. This directly constrains queries per second (QPS) performance during training and inference and impacts overall R&D efficiency.
Inefficient multi-cloud collaboration and data flow: As robotics companies increasingly adopt multi-cloud architectures for their R&D and production applications, ensuring efficient data synchronization and sharing across different cloud platforms and geographical regions has become a common challenge for the industry. Traditional storage solutions typically suffer from low cross-cloud data transfer efficiency and are often deeply integrated with a single cloud provider. This leads to technical lock-in and makes it difficult to achieve flexible cross-cloud deployment and data collaboration.
The impossible trinity of performance, cost, and operations: High-performance parallel file systems offer high throughput and low latency but typically rely on all-flash arrays or dedicated hardware. This leads to high hardware investment and ongoing operational costs, plus complex deployment. Low-cost object storage offers good elasticity but is difficult to support the high-throughput I/O demands of GPU clusters in AI training scenarios. A common industry workaround is using a high-speed file system as a cache synchronized with S3. However, the extra data synchronization steps significantly reduce usability and fail to achieve efficient storage-compute synergy.
Difficulty in dataset version management: The rapid iteration cycle of robot models requires efficient and granular management of multiple dataset versions. Using physical copies for version control directly leads to exponentially higher underlying storage consumption, significantly increasing costs. Moreover, the difficulty of retrieving, reusing, and maintaining multi-version data also increases substantially.

Storage selection: JuiceFS vs. MinIO/S3 vs. PFS

To address these storage challenges, we established a clear evaluation framework for storage selection. A comprehensive comparative test was conducted on mainstream storage solutions across seven core dimensions: storage architecture, protocol compatibility, metadata performance, scalability, multi-cloud adaptability, cost efficiency, and operational complexity.

Comparison basis	JuiceFS	MinIO / Public Cloud S3	CephFS / Public Cloud FS (CPFS)
Storage architecture	Separation of metadata and data	Unified object storage	Metadata and data typically coupled, often with kernel-space parallel design
Protocol support	Full compatibility: POSIX, HDFS, S3 API, Kubernetes CSI	Primarily S3 API, with weak POSIX compatibility	POSIX-oriented; HDFS or S3 compatibility often requires plugins
Metadata performance	Very high: sub-millisecond latency, supports hundreds of billions of files per volume	Lower: high metadata overhead for massive small files; API call overhead about 10–30 ms	Medium to high: performance bottlenecks and complexity challenges at ultra-large scale (100M+ files)
Scalability	High: horizontal scaling, supports tens to hundreds of billions of files per volume	High: near-infinite storage capacity, but small-file management efficiency degrades with scale	Moderate: scaling limited by metadata nodes; operational complexity grows exponentially with scale
Multi-cloud adaptability	Native support	Relies on sync tools; cross-cloud data flow inefficient; global unified view difficult	Limited: often tightly bound to specific hardware or cloud provider; cross-cloud deployment is complex
Cost efficiency	High performance-to-cost ratio	Low (storage only): cheap storage, but low GPU utilization in high-throughput scenarios like AI training	High: often requires all-flash architecture or dedicated hardware; high operational labor cost

Based on the comparison results above, JuiceFS demonstrates significant advantages in core performance, scalability, multi-cloud adaptability, and cost efficiency. This makes it the preferred choice for our unified storage solution.
Furthermore, JuiceFS has been widely adopted in the autonomous driving industry. Leading companies such as Horizon Robotics have leveraged JuiceFS to manage data at the exabyte scale. This demonstrates its maturity and effectiveness in large-scale production environments.
For our specific application scenarios, JuiceFS' core technical advantages:

Decoupled architecture: JuiceFS adopts a metadata-data separation architecture, persisting data in cost-effective object storage (like S3 or OSS) while storing metadata in databases like Redis or TiKV. This decoupled design enables elastic storage scaling and reduces dependence on any single cloud provider.
Chunking and caching mechanisms: JuiceFS uses chunks, slices, and blocks to significantly improve small file read efficiency and enhance concurrent read/write performance. In addition, multi-level caching (memory, local SSD, distributed cache) reduces access latency for hot data. This meets the demands of high-throughput training workloads.
Cloud-native adaptability: By providing a CSI Driver, JuiceFS delivers persistent storage decoupled from compute nodes in Kubernetes environments, supporting stateless container deployment and cross-cloud migration. It enables data sharing, enhances application high availability and flexibility, and adapts to various Kubernetes deployment methods.
Full-stack support for AI training: JuiceFS fully supports POSIX, HDFS, and S3 API, and is compatible with mainstream AI frameworks such as PyTorch and TensorFlow. It can be integrated without code modifications, lowering the technical barrier for adoption.
Multi-cloud support: Its cross-cloud capabilities and high-performance metadata engine ensure efficient data flow, perfectly aligning with our strategy of "computing power on demand."

From a cost perspective, JuiceFS does not offer a significant cost advantage in the early stages of small-scale deployment. However, when data volume reaches the petabyte level—especially at the 10 PB or 100 PB scale—and is compared against all-flash storage solutions, its cost-efficient architecture built on object storage becomes fully evident. In addition, JuiceFS requires minimal operational overhead. Currently, we need only one engineer to manage the entire cloud platform and storage system, a fraction of the personnel required by traditional solutions.

From Community Edition to Enterprise Edition: addressing larger-scale scenarios

As our application continued to expand, we encountered limitations when using Redis as the metadata engine—specifically, physical memory capacity constrained data scalability. When the number of files approached the hundred-million level, metadata query latency increased significantly. This impacted the concurrency efficiency of training tasks. After using the clone feature, the metadata volume grew substantially. In addition, in cross-cloud scenarios, we faced higher demands for metadata synchronization and mirror file system capabilities. We also required more granular capacity controls and permission management at the directory level.

Considering these requirements—along with our desire to leverage local SSDs on GPU nodes to build a distributed cache layer for improved performance—we decided to deploy JuiceFS Enterprise Edition in parallel, migrating core scenarios such as ultra-large-scale directory management and multi-node collaborative training to this version. Through this scenario-based approach, we’ve effectively enhanced the adaptability of our overall storage system and established a solid foundation for future application growth. Below are the key features of the Enterprise Edition that we’ve applied in real-world scenarios.

High-performance metadata engine: solving the bottleneck of large-scale directory retrieval

For high-frequency operations such as traversing directories with hundreds of millions of files and deep pagination queries, we previously encountered the "slower as you query" problem with traditional storage solutions. When the number of files in a single directory exceeded 10 million, and the pagination offset surpassed 100,000 entries, response latency would spike from hundreds of milliseconds to several seconds. This severely impacted data filtering efficiency.

After switching to JuiceFS Enterprise Edition, its native tree-structured metadata storage architecture played a key role. Unlike the flat key-value storage used—which stores file metadata in a disordered manner—this tree structure allows direct navigation to directory levels, reducing the scope of metadata scans. In our actual tests, deep pagination queries (with an offset of 500,000 entries) in a directory containing 120 million files saw latency drop from 3.8 seconds to just 210 milliseconds. This fully met the retrieval needs of large-scale datasets. In addition, this engine supports storing hundreds of billions of files per volume, and we’ve already used it to manage three petabyte-scale training datasets stably, aligning with our application growth expectations.

Enterprise-grade distributed cache: improving data sharing efficiency in multi-node, multi-GPU training

In multi-node, multi-GPU training scenarios, we previously faced challenges such as low cache hit rates and cross-node bandwidth congestion. The open-source version only supports local caching on each node. This means that when multiple nodes pull the same dataset simultaneously, each node must access object storage independently. This resulted in single-node bandwidth utilization exceeding 90%, with average training job startup delays of up to 20 minutes.

With JuiceFS Enterprise Edition's distributed caching feature, we set up a distributed cache across a 12-node training cluster using just three commands. The dataset only needs to be pulled from object storage once and is cached in a pool built from local SSDs across the nodes. As a result, the cache hit rate for multi-node collaborative training increased from 45% to 92%, cross-node bandwidth utilization dropped to below 15%, and training job startup time was reduced to under three minutes. This significantly improved compute utilization.

Enhanced cross-cloud collaboration: building a low-operational-cost cross-cloud data foundation

Since our R&D environments are distributed across two cloud environments, we previously encountered challenges with slow cross-cloud data synchronization and high operational costs. Using traditional synchronization tools to maintain data consistency between the two clouds required configuring eight scheduled tasks, with an average synchronization delay of four hours, and dedicated personnel needed to investigate sync failures weekly.

By using the JuiceFS sync tool combined with our internal AI operations tools, we achieved automated configuration of synchronization policies. The system automatically adjusts sync priorities based on data heat levels, keeping cross-cloud data latency within 10 minutes. In addition, tasks such as failure retries and log alerts for synchronization are fully automated, eliminating the need for dedicated monitoring. This has reduced operational overhead by 70%, and we now stably support multiple training projects across two cloud platforms sharing the same dataset. Going forward, we plan to use the Enterprise Edition's mirror file system feature to further enhance cross-cloud data collaboration.

JuiceFS optimization

Client cache and write performance tuning

We need to pay attention to compatibility issues between caching strategies and Kubernetes resource limits. For example, using memory as a local cache path with improper configuration may lead to abnormal memory growth in the Mount Pod, or insufficient resource quota reservations may cause checkpoint loss or file handle write exceptions during long-running training tasks.

Regarding write performance tuning, enabling writeback mode can improve small file write throughput to some extent. However, considering production environment requirements for data consistency, we still adopt write-through synchronous mode to reduce data risks in extreme crash scenarios. It’s recommended to cautiously enable writeback mode only in scenarios with lower data reliability requirements, such as temporary computing or offline data cleaning, based on actual needs.

Deployment and network topology optimization

For more stable performance, it’s strongly recommended to deploy the metadata engine and compute nodes within the same region during deployment. In actual operations, we observed that cross-region deployment could increase metadata operation latency by several to ten times. This significantly impacted I/O-intensive operations such as data decompression. Deploying metadata services and GPU computing resources within the same region helps maintain performance while controlling network transmission costs, improving overall resource utilization efficiency.

Data warm-up and cache optimization

In a 10-gigabit network environment, fully utilizing JuiceFS' data warm-up and reasonably adjusting data block sizes based on application scenarios can better leverage network bandwidth capabilities and improve read throughput. Combined with the distributed cache architecture, this can effectively enhance data sharing efficiency in multi-node concurrent scenarios and improve cache hit rates during high-concurrency reads. This thereby optimizes the overall performance of large-scale AI training tasks.

Resource quotas and high availability guarantee

In enterprise-level multi-role operations and storage responsibility separation scenarios, to avoid operational risks caused by inconsistent configurations, it’s recommended to finely control resource quotas for JuiceFS CSI Driver in Kubernetes environments. By appropriately setting CPU and memory request/limit for Mount Pods, Pod restarts or node anomalies caused by resource preemption can be reduced. In practice, resource reservation ratios can be dynamically adjusted based on cluster load.

In addition, for scenarios with high application continuity requirements, the automatic mount point recovery feature for Mount Pods can be enabled to achieve automated fault recovery for storage services, further ensuring underlying storage stability.

Multi-tenancy

We provide independent file systems and storage buckets for large enterprise customers, while achieving isolation for small and medium-sized enterprises and end users through subdirectory-level directory isolation and permission control.

Large enterprises can flexibly scale throughput and capacity, avoiding performance bottlenecks associated with shared storage buckets. For small and medium-sized enterprises and end users, we ensure data security and independence through subdirectory isolation and permission control, while enabling accurate metering and billing.

This architecture ensures tenant isolation while flexibly allocating resources, improving system management efficiency.

Version management

Using the juicefs clone command, copies of original datasets can be quickly created and modified independently without affecting the source data. The clone operation only copies file metadata, while data only stores additional changes, saving underlying storage space. This feature supports managing multiple versions, facilitating rollback and recovery and ensuring data security and version control.

Summary

JuiceFS' characteristics in metadata performance, scalability, cross-cloud adaptability, and comprehensive cost efficiency have made it our choice for building a unified storage layer. Currently, we adopt both JuiceFS Community Edition and Enterprise Edition to accommodate different storage requirements across various application scenarios.

In the future, we plan to further implement JuiceFS in the embodied intelligence field, addressing specific storage needs in this scenario. These include high-throughput processing of time-series data, precise multi-modal data alignment, edge-cloud collaborative storage, and integrated management of simulation and real-world data.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.