From Hadoop to cloud native — Separation of compute and storage for big data platform

2022-10-11
Rui Su

About Author

Rui Su, the partner and the first employee of Juicedata, has been actively participating in the JuiceFS open source community. Rui has worked as the founder & CEO of Kung Fu Car Wash, an Internet O2O car service company, and PM & Tech Lead of Douban Movies (Chinese version of Rotten Tomatoes or IMDB). During his work, Rui has experienced the architecture transitioning from the big data platform dominated by the Hadoop technology stack to the era of compute and storage separation in cloud native.

A short review of the Hadoop architecture

Hadoop, released in 2016, was designed as an all-in-one framework to solve problems involving massive amounts of data storage and computation. The early Hadoop consisted of three core components: MapReduce (for computation), YARN (for resource scheduling), and HDFS (for storage).

1

The computation part has been developed the most rapidly among the three components. In the beginning, there was only MapReduce, but soon later, various derivatives have been developed including computation frameworks (such as Tez and Spark), data warehouses, (e.g., Hive) and search engines (like Presto and Impala). There are also abundant components for data transfer (e.g., Sqoop) that can be equipped with the above components.

2

The underlying storage has been developing for nearly 10 years, and HDFS has been far and away the best. However, the other side of the thriving HDFS is that HDFS becomes the default choice for nearly all computing components. The above-mentioned components developed from the big data ecosystem are all designed for HDFS API, and some components even directly make use of some HDFS capabilities, such as the low-latency writing of WAL log on HBase and the data locality capability of MapReduce and Spark.

These big data components designed for the HDFS API bring potential challenges to the subsequent deployment of the data platform to the cloud.

The following diagram presents part of a simplified HDFS architecture, which couples computation with storage. There are three nodes in this diagram, and each node carries the role of HDFS DataNode for storing data with a Node Manager process allocated by YARN at the same time. The Node Manager makes YARN believe that the corresponding DataNode resides within its management scope, and the computing tasks will be accordingly allocated to the DataNode if any. In this case, storage and data reside in the same machine, and the data can be directly read from the disk during the computation.

3

Why is Hadoop a coupled computing architecture?

An important reason is the limitation of network communication and hardware. In 2006, cloud computing had barely been developed, and Amazon just released its first service. At that time, the mainstream NIC was only 100 M, which was the main challenge; the disk throughput for big data was about 50 MB/s, equivalent to 400 Mbps regarding network bandwidth. Under that condition, if there are eight disks in one node, data transmission needs to take a few Gb at full throughput, exceeding the upper limits of the bandwidth (1 Gb). That means that all the disks in the node cannot be fully utilized. Therefore, in the case that the computing task is on one side of the network and the data on the DataNode resides on the other side, the network bandwidth is obviously the major bottleneck if the computing task needs to be carried out through network transmission.

The demand for the separated storage and compute architecture

From 2006 to around 2016, the data had grown rapidly, but the demand for computing capability, which is developed by human beings, had not grown that fast. Additionally, the data may not be used immediately after its generation, and thus enterprises attempt to store the data as much as possible first and dig out its value in the future.

In this scenario, the coupled architecture can affect capacity expansion. When the storage is not enough, we need to not only add machines (i.e. disks) but also upgrade CPU and memory since the DataNode is also in charge of computing in the coupled computing architecture. Therefore, the machines are typically equipped with a well-balanced computing power and storage configuration, which can provide sufficient storage capacity and comparable computing power simultaneously. However, in practice, the computing power demand does not increase as much as the amount of data, so expanding computing power in this way can cause a big waste. Also, the utilization of the storage and I/O resources of the entire cluster may also be very unbalanced, and the bigger the cluster, the more unbalanced it will be. Besides, it is not easy to buy a machine with a well balanced computing power and storage configuration.

Lastly, the strategy of data locality scheduling may not be able to play a role in the actual scenarios because the data distribution can be unbalanced, e.g., some nodes have become local hotspots and thus need much more computing power. The I/O may also become a bottleneck although the tasks on the big data platform can be scheduled to the hotspot nodes.

The development of hardware has facilitated the separation of storage and compute. First, 10Gb NIC has become popular, and 20Gb, 40Gb, or even 50Gb NIC are getting more and more in the computer room or in the cloud nowadays. Some AI scenarios even have 100Gb NIC. Also, the bandwidth of the network has increased by 100 times.

4

With respect to storage, many enterprises still use disks for storage in large clusters at present, and the disk throughput has doubled from 50 MB/s to 100 MB/s. An instance configured with a 10 Gb NIC can support a peak throughput of about 12 disks, which is sufficient for most enterprises, and thus the network transmission is no longer a bottleneck.

NIC and disks have been changing over time, and so has the software. The csv files or zip packages had been used a lot to store data in the early time, but now there are more effective compression algorithms, e.g., snappy, lz4, zstandard. The column storage formats have also been developed, e.g., Avro, Parquet, and Orc.

All these developments together have further reduced the amount of data for transmission. Meanwhile, the NIC has also been improved, plus the throughput of the hard disks does not increase much. As a result, the bottleneck caused by I/O has been gradually weakened and diminished, which is able to guarantee the feasibility of the storage and compute separation.

How to separate storage and compute?

The first try: Deploy HDFS independently on the cloud

There have been some attempts to separate storage and compute since 2013. The initial solution is relatively simple, that is, to deploy HDFS independently rather than mixing with the computing worker. As this solution is based on the Hadoop ecosystems, new components are not needed.

The following diagram of the disaggregated architecture shows that the Node Manager is no longer deployed on the DataNode, meaning that computing tasks will not be sent to the DataNode. In this case, storage becomes an independent cluster, and the data for computing can be transmitted through the network, supported by the end-to-end 10G NIC (Note that the network transmission line is not marked in the diagram below.)

Although this solution abandoned the most ingenious design of data locality, the increase in network communication speed remarkably facilitates cluster configuration. This has been verified by the experiments that the Juicedata cofounder, Davies, carried out together with this teammate when he worked at Facebook in 2013, and the results corroborated the feasibility of independent deployment and management of computing nodes.

5

However, this solution has not been developed further. Why? The main reason is its shortcomings when combined with cloud resources, which may not appear as issues when working in the local data centers.

The first reason is that the HDFS multi-replica mechanism can increase the cost of enterprises on the cloud. In the past, bare disks were used by enterprises to build up a set of HDFS. To diminish the risk of disk damage, HDFS provides a multi-replica mechanism to ensure data safety and data availability. Multiple replications can also guarantee data reliability if a DataNode is down by accident and the data on this node cannot be accessed. However, when it comes to the cloud, as the cloud disks have been configured with multiple replications, if an HDFS is built up based on those cloud disks with three replicas, the number of replicas stored on the cloud will accordingly be tripled, thus significantly increasing the cost for the enterprises.

Secondly, this solution couldn’t take full advantage of the unique value of the cloud, such as out-of-the-box, elastic scaling, and pay-as-you-go. The deployment of HDFS on the cloud requires users to create machines, manually deploy and maintain, and monitor and operate on their own with no ease to expand and shrink the capacity. In this case, separation of storage and compute is still not easy.

The third reason: is the limitation of HDFS itself. Firstly, NameNode can only expand vertically, and it cannot expand distributedly, which limits the number of files that can be managed by a single HDFS cluster. When the NameNode takes up too many resources with a high load, FullGC (Garbage Collection) may be triggered, which can affect the availability of the entire HDFS cluster. The system storage may be down, unable to read, and unable to intervene in the GC process, and there is no way to determine how long the system will be stuck. This is also the sore point of HDFS high-load clusters.

According to our experience, it is generally easy to operate and maintain HDFS with less than 300 million files, and the complexity of operation and maintenance will be significantly increased if exceeding 300 million files. The limit for a standalone cluster is around 500 million files. Storage of over 500 million files needs to introduce the HDFS Federation mechanism, which can consequently increase costs of O&M and management.

Public cloud + Object storage

With the development of cloud computing technologies, enterprises have one more option for storage — object storage. Object storage is suitable for unstructured data on a large scale. It is originally designed for data upload and download. With the super powerful elastic scaling capabilities of enterprise storage systems, object storage can become a low-cost option.

It started with AWS at the earliest, and later all cloud vendors followed the scheme and began to promote the replacement of HDFS with object storage. Two of the most obvious benefits of deploying object storage that HDFS cannot achieve are:

  1. Service-oriented and out-of-the-box, i.e., without need for deployment, monitoring and O&M.
  2. Elastic scaling and pay-as-you-go, i.e., no needs of planning capacity ahead.

Although this solution has greatly simplified O&M compared to independent deployment of HDFS on the cloud, the following problems may arise when the object storage is used to support complex data systems like Hadoop.

  1. The poor performance of file listing

Listing is one of the most basic operations in the file system. The List directory of the file system, including the List directory in HDFS, is very fast, and the good performance is due to the tree structure of the file system.

In comparison, data in object storage is arranged in a flat structure. To store tens of thousands or even hundreds of millions of objects, an index will be built by using a key in the object storage. The key can be considered as a file name, the unique identifier of the object. With this structure, Lising will only look up in this index, and its performance is much weaker than in a tree structure.

6

The structure of file system: tree structure, suitable for arranging data by directory

The structure of object storage: flat structure, suitable for data storage and direct access

2) No atomic rename for object storage, affecting stability and performance

In the ETL computing model, the results will be written to the temporary directory after the completion of each subtask. The temporary directory can be renamed to the official directory name when the entire task is completed.

Such rename operations are atomic, fast, and transactionally guaranteed in HDFS and other file systems. However, since object stores do not have a native directory structure, handling rename operations is an analog process that will contain a large number of copies of data within the system, which can be time-consuming and has no transactional guarantees.

The path of the file system is used as the key of the object in the object storage, such as ‘/order/2022/08/10/detail’. The renaming operation will search for all the keys containing objects with the directory name and then use the new directory name as the key to copy all the objects. At that time, data copy occurs, which remarkably reduces the performance of the object storage, 1-2 orders of magnitude slower than file system; additionally, as this process does not have transactional guarantee, there is a risk of failure during renaming, causing incorrect data. Thus, even a tiny difference between object storage and file system can impact the performance and stability of the entire task pipeline.

3) The eventual consistency of the object storage, reducing the computing stability and correctness

For example, if multiple clients create files concurrently under a path, the file list obtained by calling the List API may not contain all the created file lists, but it will take a while for the internal system of object storage to complete data consistency synchronization. Such access patterns are often used in ETL data processing, and eventual consistency may affect the correctness of data and the stability of tasks.

Object storage cannot guarantee strong consistency. To mitigate this problem, AWS has released EMRFS. The principle of EMRFS is to prepare another DynamoDB database since it is known that Listing results may not be correct. For example, when Spark writes files, it also writes a file list to DynamoDB, following establishing a mechanism to continuously call the List API of object storage and comparing it with the results stored in the database until data in object storage and database are equal. However, the stability of this mechanism is not good enough and affected by the load of the region where the object storage locates. Thus, it is not an ideal solution for strong consistency.

In addition to the above-mentioned problems caused by the differences between the file system and the object storage, another issue of using Hadoop on object storage is the poor compatibility of the object storage with the Hadoop components. The beginning of the article mentioned that HDFS was almost the only storage option in the early days of the Hadoop ecosystem, and various upper-layer components were developed for the HDFS API. However, when it comes to object storage, the structure of data storage has changed, with the change of API.

In order to adapt to the existing Hadoop components, cloud vendors need to modify the connectors between the components and cloud object storage on one hand; on the other hand, they need to patch the components in the upper layer and verify the compatibility of each component, which means a huge workload for public cloud vendors. Therefore, computing components of big data platforms provided by cloud vendors are very limited. Generally, only three common components can be included, namely, Spark, Hive, and Presto, with only a few versions. This brings challenges for users who want to migrate their big data platforms to the cloud, or who need to use their own distributions and components.

How to take full advantage of object storage without sacrificing the accuracy of a file system?

Object Storage + JuiceFS

It is clear that object storage cannot meet the needs of complex data computation and analysis, and that is exactly what JuiceFS is designed for. JuiceFS attempts to compensate for the shortcomings of object storage, and provide good services for computing, analyzing, and AI training with object storage at a low price.

How does the JuiceFS + object storage work? We take an example of deployment of JuiceFS in a Hadoop cluster. All the worker nodes managed by YARN carry a JuiceFS Hadoop SDK, which can guarantee the full compatibility with HDFS. As you can see at the bottom of the graph below, the SDK needs to access two parts, JuiceFS Metadata Engine and S3 bucket. The metadata engine is equivalent to the NameNode in HDFS, which stores all the metadata of the file system, including the number of directories, file names and permission timestamp. The use of metadata engine correspondingly solves the problems such as the scalability of HDFS NameNode and GC.

7

On the other hand, data is stored in the S3 bucket, which is equivalent to the DataNode in HDFS. The S3 bucket can be used as a large number of disks, and manages the related tasks of data storage and replication. In this example, JuiceFS is composed of three components, JuiceFS Hadoop SDK, Metadata Engine and S3 bucket.

What are other advantages of JuiceFS over using object storage directly?

  1. Complete compatibility with HDFS

This is due to the design of JuiceFS to be fully POSIX-compatible. Its POSIX API coverage and complexity are greater than that of HDFS.

Meanwhile, JuiceFS can be used together with HDFS instead of replacing HDFS. This benefit is from the design of the Hadoop system. Multiple file systems can be configured in one Hadoop cluster, in which JuiceFS and HDFS can be used at the same time. This architecture does not require users to replace the present HDFS clusters completely with JuiceFS, which is nice for users since the workload and risk of a complete replacement are relatively high. As a result, users can do integration in batches based on the cluster conditions by combining them with services.

2. The powerful performance of metadata

JuiceFS separates the metadata engine from the original data, which greatly improves metadata performance. JuiceFS has simplified calls of the underlying object storage to only compose of the three most basic operations, i.e., GET, PUT, and DELETE. Such an architecture can effectively shield users from the problems derived from the poor metadata performance of object storage, and also solve the issues related to eventual consistency.

3. Atomic rename

JuiceFS uses a dedicated metadata engine, which supports atomic file rename.

4. Cache, effectively improving the access performance of hot data and providing the data locality feature.

With cache, hot data no longer needs to be read from object storage through the network every time. Moreover, JuiceFS implements the HDFS-specific data locality API, so that all upper-level components that support data locality can regain the awareness of data affinity. This allows YARN to prioritize its own tasks to be scheduled on the nodes with an established cache, resulting in an overall performance comparable to coupled computing HDFS.

5. POSIX compatibility, easy to integrate with machine learning and Al-related applications.

With the change of enterprise requirements and the development of basic technologies, the architecture of storage and computing has been changing, from the initially coupled architecture to the separated storage and compute. At present, there are various ways to achieve the separated architecture, each with its own advantages and disadvantages, from directly deploying HDFS to the cloud, to using public clouds for Hadoop-compatible solutions, and then to the public cloud + JuiceFS solution, which is suitable for complex big data computing and storage on the cloud. For enterprises, there is no silver bullet, and the key is to choose the architecture based on their own needs.

But no matter what you choose, keep it simple.