Background & Challenges
The Hadoop ecosystem is popular in big data, and HDFS requires long term maintenance especially when data continues to grow, which keeps posing challenges to the community. JuiceFS solves those problems by designing for the cloud, providing users with fully managed service, which can handle tens of billions of files on a single file system, making it the ideal data storage option for big data on public cloud providers.
When companies using Hadoop migrate from IDC to the public cloud, the first challenge is how to migrate the data stored in HDFS. Typically, the public cloud does not provide a fully managed HDFS solution, and customers still have to run it themselves. In addition, although HDFS is a common choice for self-built data storage systems, it does not fit well with existing storage products in the public cloud, and can not take advantage of the flexibility of the public cloud, making it less effective.
There are also many problems with directly using object storage for big data platforms. Object storage is not a file system and lacks essential features that are heavily relied on by computing components such as Hadoop and Spark, such as strong consistency, atomic renaming, etc.
As a file system based on object storage, JuiceFS maintains the advantages of flexible, no-maintenance and low cost of object storage, and provides strong consistency, high performance and high availability metadata service, to ensure the correct, stable and efficient execution of large-scale data analysis tasks.
In short, JuiceFS can do things faster, cheaper, and easier on a petabyte scale.
JuiceFS is fully compatible with Hadoop / Spark / Hive / HBase / Presto / Impala. You can use it as a supplement to HDFS to store cold data (usually takes up most space), or completely replace HDFS with JuiceFS, this separates compute and storage, which fully takes advantage of flexibility & scalability provided by public cloud computing.
Expect the following benefits when using JuiceFS for big data:
Lower Storage Cost
With HDFS, you need to purchase machines, furthermore, machines demands man-hours to maintain. In contrast, JuiceFS is a fully managed storage solution, you only pay for storage service. No need to plan ahead and allocate storage space due to our elasticity. Considering HDFS replication factor is default to 3, with the same amount of data, JuiceFS can reduce storage costs by up to 70%.
No stop-the-world garbage collection (GC) on NameNodes
HDFS is written in Java and suffers from stop-the-world garbage collection issues that cause the entire cluster to stop processing at unpredictable times. JuiceFS has no such limitations.
No continuous capacity management / tuning
HDFS usually requires long term capacity planning and management, and continues to scale vertically or horizontally to meet changing storage needs.
JuiceFS is completely flexible, no need to plan ahead, pay as you go.
HDFS requires monitoring and maintenance to ensure high availability, JuiceFS has a dedicated SA team, we also have a more efficient failover solution that provides higher availability than HDFS.
No expensive Professional Services contracts that price-scale to the size of your cluster
Due to the complexity of HDFS, many companies buy expensive third-party professional services to ensure the stable operation.
JuiceFS is a fully managed service, its price already includes professional services, the JuiceFS team will be responsible for service reliability.
Replicate across all regions/zones or across heterogeneous cloud providers.
HDFS does not support remote data replication, customers need to design and implement other complex data replication programs, the effect is very limited.
JuiceFS allows you to copy data to any area of any cloud, allowing you to access the same data very efficiently both at the same time, and seamlessly migrate computing tasks between two public clouds or two regions.
Mirror all or part of your data to any place around the world in near real-time.
JuiceFS also provides nearing real-time data mirroring of any of the public clouds and regions around the world, with only seconds of data latency while ensuring data consistency.
JuiceFS clients installed on your VM communicate with the object store directly, and your data will never go through our servers or third-party proxies to guarantee the absolute privacy of your data.
Data replication is done entirely through the client on your VM.
On the other hand, JuiceFS has the following advantages over directly using object storage for big data:
- Strong data consistency guarantee;
- Performance is more than ten times higher than object storage;
- Run ad-hoc queries directly from the Linux command line. With object storage, it'll need to be downloaded first, which can be wearisome.
Background & Challenges
Face recognition technology has been applied to a variety of life scenes, and autopilot seems to be getting closer and closer to us. Today, with the arrival of artificial intelligence, all the "intelligence" comes from the ability to process and analyze massive amounts of information. The information here contains thousands of forms, including text, pictures, audio, video, medical images, and from data collected by various sensors. The amount of data is growing exponentially, presenting new challenges for our storage system.
We interviewed many top teams in the field of artificial intelligence and noticed that data storage is a challenge that everyone is facing together, especially in the fields of image recognition, sound synthesis, autopilot, etc., to store and process billions or even tens of billions of data. It is a huge challenge for existing storage systems.
JuiceFS has targeted optimization for machine learning scenarios and billions of file sizes storage, providing ample I/O capabilities for model training. JuiceFS uses the POSIX interface to support machine learning frameworks such as TensorFlow / MXNet / Caffe / PyTorch without any custom development.
In the machine learning scenario, JuiceFS has the following advantages:
- Optimized for billions of inodes, memory efficiency is 5 to 10 times of existing open source storage solutions;
- Prepared a complete caching strategy to improve the I/O load requirements in the machine learning scenario;
- Directly API adaptation without any custom development;
- POSIX provides an intuitive way to manage data;
- Data sharing between multiple clusters.
Application Data Backup and Recovery
Background & Challenges
The data of core application needs to be backed up frequently, and it will save full and incremental backups of multiple versions over a long period of time based on a complete and rigorous backup strategy. The required space is usually more than ten times that of the application data itself, such as MySQL / MongoDB.
The most popular solution is the cloud disk. Since the cloud disk can only be accessed in a single machine, you need to maintain a large number of disks for backup of multiple application instances, and you will encounter the maximum capacity limit. Besides, you need to do capacity planning and management at the beginning. When using backup data, you need to find the corresponding disk and mount it on the host used to restore the application. This process is difficult to automate and the operation and maintenance efficiency is hard to improve.
If you upload to object storage for archive, although it can be flexibly expanded and the price is relatively cheap, the backup-verify-recovery process takes a very long time. In the GitHub 2018 database splitting incident, approximately 8 hours spent downloading backup data from the object storage.
Mount JuiceFS to the application node that is used for backing up (such as the replica node of the database), execute the physical backup command, and write the data directly to the JuiceFS mount directory. JuiceFS will automatically compress data during the writing process, dramatically reducing data transfer size and increasing speed.
Besides, JuiceFS can also enable data encryption, encrypt data in parallel while backing up, and ensure data privacy and security while still maintaining high backup efficiency.
JuiceFS supports directory-based atomic snapshots, which can take snapshots of backup data and then use snapshot data to initiate a MySQL instance to verify the correctness of the backup. The data in the process is modified based on the copy-on-write mechanism and does not affect the original backup. Once the verification is complete, the snapshot can be deleted directly.
Log Data Collection and Archiving
Background & Challenges
Nowadays, logs are not only used in the debugging of system problems. Many users access logs have been widely used in the field of business intelligence. Through log analysis and mining, many neglected important values can be discovered, and benefits to improving user experience and increasing business value. The collection and archiving of log data is the first step towards business intelligence.
Every service generates logs, therefore the log generation is extremely fragmented, requiring service to aggregate the logs collected on each node and archive them together. In the open source community and commercial fields, there are many log collection systems, but their deployment, maintenance, and troubleshooting are sophisticated and require continuous human operation and maintenance.
At the same time, as time goes by, the archived log needs a large space to store, on the one hand, it need meet the performance requirements of the analysis and calculation, on the one hand, the cost is also need consider economical. Although the object storage meets the advantages of flexible expansion and low price, it is inconvenient in analysis, query, and management, with poor performance and low efficiency.
JuiceFS has multi-machine sharing features and can manage based on directory structure archive, which is suitable for log collection and archiving. Just mount JuiceFS to all the nodes that generate logs (can be containers, virtual machines, physical machines), and use the system built-in log scrolling mechanism to automatically complete rename, package, compress and copy to JuiceFS process.
Using JuiceFS for log collection and archiving will bring you the following advantages:
- No need to maintain collection components and deploy a large number of agents;
- Fully compatible with POSIX, without any learning threshold;
- Compatible with the computing framework in the Hadoop ecosystem, the performance is ten times more than the object storage;
- Easy to query, compatible with all Linux command line tools;
- Flexible and scalable, unlimited capacity, say goodbye to capacity planning;
- Support the trash bin to prevent accidental deletion.
Background & Challenges
NAS is the most common solution for enterprise data sharing, but maintaining a highly available NAS is very difficult. It also needs to consider optimizing the performance bottleneck of the NFS gateway, the limited number of access machines and the insecure plaintext data transportation. Enterprise data sharing requires the support of next-generation storage products.
JuiceFS is the ideal enterprise-class NAS replacement. It features high availability, flexible scaling, encrypted transport & storage, and support for thousands of machines simultaneously mounting to share data. It is fully compatible with existing applications and can migrate to JuiceFS without modifying a single line of code.
Data sharing with JuiceFS has the following advantages:
- High availability based on the Raft protocol;
- The capacity is elastic and scalable, up to 10PB;
- Support thousands of clients to mount at the same time while simultaneously reading and writing;
- Millisecond delay, high throughput;
- Secure access using TLS encrypted transport;
- Support for snapshots, trash bins, and full Linux access control rules.
Off-site Data Protection
Background & Challenges
Data backup is critical to the business. Even in Google's data center, there have been accidents in which data was lost by the Thunderstrike. Many customers have overlooked off-site data protection. The important data needs to be backed up between different cities and even different countries to ensure the security and continuity of the business.
In the past, the off-site data protection plan is usually involved to build a data center in an off-site place. Even with the help of a public cloud platform, it is often necessary to set up a computing resource to communicate with the main data center to complete the backup task. The input of human and material resources is very large, which is the reason for the off-site data protection has been difficult to achieve in the past.
JuiceFS provides fully automated data replication across service areas and across public cloud platforms. Hybrid cloud customers can also easily back up data from IDC to the cloud via JuiceFS.
Using JuiceFS for off-site data protection has the following advantages:
- Fully automatic without any human resources;
- Can be copied to any availability zone and public cloud platform;
- The TCO has dropped dramatically, and only the object storage is consumed in off-site places, which saves a lot of CPU and memory resources;
- Provide a sub-minute level of Sino-US cross-continent data replication capabilities;
- Encrypt data block for storage, rest assured to back up sensitive data;
- JuiceFS is deployed across two availability zones by default, which is equivalent to the same city and is not affected by the failure of the public cloud single availability zone.