Solution

Application Data Backup and Recovery

Background

The data of core application needs to be backed up frequently, and it will save full and incremental backups of multiple versions over a long period of time based on a complete and rigorous backup strategy. The required space is usually more than ten times that of the application data itself, such as MySQL / MongoDB.

The most popular solution is the cloud disk. Since the cloud disk can only be accessed in a single machine, you need to maintain a large number of disks for backup of multiple application instances, and you will encounter the maximum capacity limit. Besides, you need to do capacity planning and management at the beginning. When using backup data, you need to find the corresponding disk and mount it on the host used to restore the application. This process is difficult to automate and the operation and maintenance efficiency is hard to improve.

If you upload to object storage for archive, although it can be flexibly expanded and the price is relatively cheap, the backup-verify-recovery process takes a very long time. In the GitHub 2018 database splitting incident, approximately 8 hours spent downloading backup data from the object storage.

Solution

Mount JuiceFS to the application node that is used for backing up (such as the replica node of the database), execute the physical backup command, and write the data directly to the JuiceFS mount directory. JuiceFS will automatically compress data during the writing process, dramatically reducing data transfer size and increasing speed.

Besides, JuiceFS can also enable data encryption, encrypt data in parallel while backing up, and ensure data privacy and security while still maintaining high backup efficiency.

JuiceFS supports directory-based atomic snapshots, which can take snapshots of backup data and then use snapshot data to initiate a MySQL instance to verify the correctness of the backup. The data in the process is modified based on the copy-on-write mechanism and does not affect the original backup. Once the verification is complete, the snapshot can be deleted directly.

Log Data Collection and Archiving

Background

Nowadays, logs are not only used in the debugging of system problems. Many users access logs have been widely used in the field of business intelligence. Through log analysis and mining, many neglected important values can be discovered, and benefits to improving user experience and increasing business value. The collection and archiving of log data is the first step towards business intelligence.

Every service generates logs, therefore the log generation is extremely fragmented, requiring service to aggregate the logs collected on each node and archive them together. In the open source community and commercial fields, there are many log collection systems, but their deployment, maintenance, and troubleshooting are sophisticated and require continuous human operation and maintenance.

At the same time, as time goes by, the archived log needs a large space to store, on the one hand, it need meet the performance requirements of the analysis and calculation, on the one hand, the cost is also need consider economical. Although the object storage meets the advantages of flexible expansion and low price, it is inconvenient in analysis, query, and management, with poor performance and low efficiency.

Solution

JuiceFS has multi-machine sharing features and can manage based on directory structure archive, which is suitable for log collection and archiving. Just mount JuiceFS to all the nodes that generate logs (can be containers, virtual machines, physical machines), and use the system built-in log scrolling mechanism to automatically complete rename, package, compress and copy to JuiceFS process.

Using JuiceFS for log collection and archiving will bring you the following advantages:

  1. No need to maintain collection components and deploy a large number of agents;
  2. Fully compatible with POSIX, without any learning threshold;
  3. Compatible with the computing framework in the Hadoop ecosystem, the performance is ten times more than the object storage;
  4. Easy to query, compatible with all Linux command line tools;
  5. Flexible and scalable, unlimited capacity, say goodbye to capacity planning;
  6. Support the trash bin to prevent accidental deletion.

Data Sharing

Background

NAS is the most common solution for enterprise data sharing, but maintaining a highly available NAS is very difficult. It also needs to consider optimizing the performance bottleneck of the NFS gateway, the limited number of access machines and the insecure plaintext data transportation. Enterprise data sharing requires the support of next-generation storage products.

Solution

JuiceFS is the ideal enterprise-class NAS replacement. It features high availability, flexible scaling, encrypted transport & storage, and support for thousands of machines simultaneously mounting to share data. It is fully compatible with existing applications and can migrate to JuiceFS without modifying a single line of code.

Data sharing with JuiceFS has the following advantages:

  1. High availability based on the Raft protocol;
  2. The capacity is elastic and scalable, up to 10PB;
  3. Support thousands of clients to mount at the same time while simultaneously reading and writing;
  4. Millisecond delay, high throughput;
  5. Secure access using TLS encrypted transport;
  6. Support for snapshots, trash bins, and full Linux access control rules.

Off-site Data Protection

Background

Data backup is critical to the business. Even in Google’s data center, there have been accidents in which data was lost by the Thunderstrike. Many customers have overlooked off-site data protection. The important data needs to be backed up between different cities and even different countries to ensure the security and continuity of the business.

In the past, the off-site data protection plan is usually involved to build a data center in an off-site place. Even with the help of a public cloud platform, it is often necessary to set up a computing resource to communicate with the main data center to complete the backup task. The input of human and material resources is very large, which is the reason for the off-site data protection has been difficult to achieve in the past.

Solution

JuiceFS provides fully automated data replication across service areas and across public cloud platforms. Hybrid cloud customers can also easily back up data from IDC to the cloud via JuiceFS.

Using JuiceFS for off-site data protection has the following advantages:

  1. Fully automatic without any human resources;
  2. Can be copied to any availability zone and public cloud platform;
  3. The TCO has dropped dramatically, and only the object storage is consumed in off-site places, which saves a lot of CPU and memory resources;
  4. Provide a sub-minute level of Sino-US cross-continent data replication capabilities;
  5. Encrypt data block for storage, rest assured to back up sensitive data;
  6. JuiceFS is deployed across two availability zones by default, which is equivalent to the same city and is not affected by the failure of the public cloud single availability zone.

Big Data

Background

The HDFS on the public cloud needs to use the cloud hard disk to build, the cost is more than three times of the bare hard disk, and the public cloud does not provide the fully managed service of HDFS, thus needs to operate and maintain by yourself.

If using object storage for big data analytics, poor performance and lack of consistency guarantees can lead to computational errors.

Solution

JuiceFS is a fully managed service that guarantees 99.95% SLA and does not require customer operations. Besides, capacity can elastically scale without an upper limit, and the price is saving more than 60% compared with the use of HDFS built by cloud disk. And JuiceFS is fully compatible with Hadoop / Spark / Hive / HBase / Presto / Impala. That is, JuiceFS can be used as a supplement to HDFS to store cold data with the most space requirements. You can also use JuiceFS to completely replace HDFS, to achieve separation between storage and computing, and to better utilize the flexible scalability of public cloud computing.

Using JuiceFS for big data storage has the following advantages:

  1. JuiceFS does not require any operation and maintenance, with a 99.95% SLA.
  2. The capacity is flexible and scalable, suitable for mass data archiving without capacity planning;
  3. Strong data consistency guarantee;
  4. Performance is more than ten times higher than object storage;
  5. Compared with self-built HDFS on the cloud, the cost savings are over 60%;
  6. Temporary queries can be done directly from the Linux command line. If the data stored in the object storage, it needs to be downloaded first, which wastes a lot of time.

Artificial Intelligence

Background

Face recognition technology has been applied to a variety of life scenes, and autopilot seems to be getting closer and closer to us. Today, with the arrival of artificial intelligence, all the “intelligence” comes from the ability to process and analyze massive amounts of information. The information here contains thousands of forms, including text, pictures, audio, video, medical images, and from data collected by various sensors. The amount of data is growing exponentially, presenting new challenges for our storage system.

We interviewed many top teams in the field of artificial intelligence and noticed that data storage is a challenge that everyone is facing together, especially in the fields of image recognition, sound synthesis, autopilot, etc., to store and process billions or even tens of billions of data. It is a huge challenge for existing storage systems.

Solution

JuiceFS has targeted optimization for machine learning scenarios and billions of file sizes storage, providing ample I/O capabilities for model training. JuiceFS uses the POSIX interface to support machine learning frameworks such as TensorFlow / MXNet / Caffe / PyTorch without any custom development.

In the machine learning scenario, JuiceFS has the following advantages:

  1. Optimized for billions of inodes, memory efficiency is 5 to 10 times of existing open source storage solutions;
  2. Prepared a complete caching strategy to improve the I/O load requirements in the machine learning scenario;
  3. Directly API adaptation without any custom development;
  4. POSIX provides an intuitive way to manage data;
  5. Data sharing between multiple clusters.