Compatibility Battle of Shared File Systems on the Cloud

2019-12-24
Rogerz, Davies

Disclaimer: This article is translated by DeepL, there is the original post in Chinese.

Everything is a file is the basic design philosophy of UNIX. Files are organized into tree directories in a hierarchical relationship, which constitute the basic form of the file system. When users use the file system to store data, they do not have to care about how the data is stored, and can access it according to the agreed interface specification.

Concepts

The most widely used interface specification for file systems is POSIX, a standard written by the IEEE committee, which includes sections on file and directory operations. The standard itself is rather lengthy and obscure, so we will not go into it here. We can refer to a Quora question and answer “What does POSIX conformance/compliance mean in the distributed systems world?” for a more comprehensive overview.

POSIX compliance requires the following features of a file system:

  • Hierarchical directory structure, supporting arbitrary depth
  • Files are created via open(O_CREAT), directories are created via mkdir, etc.
  • Directories can be traversed by opendir/readdir
  • Paths/namespaces can be modified by rename, link / unlink, symlink / readlink, etc.
  • Data can be written via write or writev, persisted when fsync, read via read or readv
  • Other interfaces such as stat, chmod / chown, etc.
  • Contrary to some popular claims, extended attributes do not appear to be part of POSIX, see The Open Group Base Specifications Issue 7, 2018 edition for a list of functions

Testing

Whether a file system actually meets POSIX compatibility can be verified by testing tools. One of the more popular test case sets is pjdfstest, from FreeBSD, but also for Linux and other systems. pjdfstest test cases need to run as root, and requires the system installed Perl and TAP::Harness (Perl package), the test process is as follows:

cd /path/to/filesystem/under/test
sudo prove --recurse --verbose /path/to/pjdfstest/tests

We selected several shared file systems in the cloud environment for testing, and examples of failed uses in the statistical test results are as follows:

Because the number of failed test cases for Amazon EFS is several orders of magnitude larger than for other products, the horizontal coordinates of the graph above are logarithmic for comparison purposes.

We also tested both S3FS and Goofys, and the number of failed cases was in the hundreds or even thousands, the underlying reason being that both projects were not designed strictly for file systems:

  • Goofys can mount S3 as a filesystem, but it is only a "Filey" system with a "POSIX-ish" interface (these two descriptions are from the official project description). Goofys was designed for performance at the expense of POSIX compatibility, and the supported file operations are heavily limited by the object storage itself, such as S3. This is confirmed by test results. It is recommended to fully review the application's data access methods before production use to avoid falling into the trap.
  • S3FS, despite its name, is actually closer to a way to manage objects in a S3 bucket with a File System View. Although S3FS supports a larger subset of POSIX, it only maps system calls one-by-one to object storage requests and does not support the semantics and consistency of regular file systems (e.g. atomic renaming of directories, mutexing when opened in exclusive mode, appending file contents causing rewriting of the entire file and not supporting hard joins, etc.). These shortcomings mean that S3FS cannot be used as a replacement for a regular file system (even without considering performance issues) because the expected behavior when an application accesses the file system should be POSIX compliant, and S3FS falls far short of that.

Analysis

Here we will categorize the test failure cases and pick a few representative ones to analyze what limitations they may have on the application.

In general, JuiceFS has fewer failed use cases and better compatibility, both in terms of number and categories; Amazon EFS has far more failed use cases than any other file system, both in terms of total number and categories, and cannot be put into the same chart for comparison, which will be analyzed separately later.

JuiceFS

JuiceFS passed most of the 8811 use cases in this test, and only failed 3 on the utimensat test set. The corresponding logs are as follows:

...
/root/pjdfstest/tests/utimensat/08.t ........
not ok 5 - tried 'lstat pjdfstest_bfaee1fc7f2c1f80768e30f203f41627 atime_ns', expected 100000000, got 0
not ok 6 - tried 'lstat pjdfstest_bfaee1fc7f2c1f80768e30f203f41627 mtime_ns', expected 200000000, got 0
Failed 2/9 subtests
/root/pjdfstest/tests/utimensat/09.t ........
not ok 5 - tried 'lstat pjdfstest_7911595d91adcf915009f551ac48e1f2 mtime', expected 4294967296, got 0

These test cases are from utimensat/08.t and utimensat/09.t. Where 08.t is to test sub-second file access time and modification time accuracy, 09.t is required to support 64-bit timestamps.

JuiceFS currently only supports seconds, and timestamps are saved as 32-bit integers, so it cannot pass these three tests (in fact, all the file systems involved in this test cannot pass this test set 100%). If your application scenario requires sub-second time precision or a larger range, please contact us to discuss a solution.

GCP Filestore

In addition to several failures on the utimesat test set, as with JuiceFS, GCP Filestore also fails 1 item in the unlink test set. This one also fails on all other file systems.

/root/pjdfstest/tests/unlink/14.t ...........
not ok 4 - tried 'open pjdfstest_b03f52249a0c653a3f382dfe1237caa1 O_RDONLY : unlink pjdfstest_b03f52249a0c653a3f382dfe1237caa1 : fstat 0 nlink', expected 0, got 1

This test set (unlink/14.t) is used to verify the behavior of a file when it is deleted in the open state.

desc="An open file will not be immediately freed by unlink"

The operation of deleting a file actually corresponds to unlink at the system level, i.e. removing the link from the file name to the corresponding inode, which corresponds to the value of nlink minus 1. This test case is to verify this.

# A deleted file's link count should be 0
expect 0 open ${n0} O_RDONLY : unlink ${n0} : fstat 0 nlink

The contents of a file are only really deleted if the link count (nlink) is reduced to 0 and there are no open file descriptors (fd) pointing to the file. If nlink is not updated correctly, it may result in files that should be deleted remaining on the system.

Tencent Cloud CFS

Tencent Cloud CFS fails several tests of open and symlink compared to Google Filestore.

open failure example

A selection of the failure logs are as follows:

/root/pjdfstest/tests/open/07.t .............
not ok 5 - tried '-u 65534 -g 65534 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
not ok 7 - tried '-u 65533 -g 65534 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
not ok 9 - tried '-u 65533 -g 65533 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
Failed 3/23 subtests

This test set open/07.t is used to verify the behavior that the O_TRUNC mode should return an EACCES error when write access is not available.

desc="open returns EACCES when O_TRUNC is specified and write permission is denied"

These three failure logs need to be analyzed in conjunction with the test code, which corresponds to the owner, group and other cases. Without loss of generality, we will only analyze the owner case:

expect 0 -u 65534 -g 65534 chmod ${n1} 0477
expect EACCES -u 65534 -g 65534 open ${n1} O_RDONLY,O_TRUNC

First set the file owner permission to 4, i.e. r-- read-only, and then try to open the file in O_RDONLY,O_TRUNC mode, which should return EACCES, but actually returns 0.

According to the description of O_TRUNC in The Single UNIX ® Specification, Version 2

O_TRUNC If the file exists and is a regular file, and the file is successfully opened O_RDWR or O_WRONLY, its length is truncated to 0 It will have no effect on FIFO special files or terminal device files. The result of using O_TRUNC with O_RDONLY is undefined.

The result of using O_TRUNC with O_RDONLY is undefined. The result of using O_TRUNC with O_RDONLY is unknown, and the file under test in this case is itself an empty file, so O_TRUNC will have no effect.

The corresponding test log is as follows:

/root/pjdfstest/tests/symlink/03.t ..........
not ok 1 - tried 'symlink 7ea12171c487d234bef89d9d77ac8dc2929ea8ce264150140f02a77fc6dcad7c3b2b36b5ed19666f8b57ad861861c69cb63a7b23bcc58ad68e132a94c0939d5/ ... /... pjdfstest_57517a47d0388e0c84fa1915bf11fe4a', expected 0, got EINVAL
not ok 2 - tried 'unlink pjdfstest_57517a47d0388e0c84fa1915bf11fe4a', expected 0, got ENOENT
Failed 2/6 subtests

This test set (symlink/03.t) is used to test the behavior of symblink when the path exceeds the PATH_MAX length.

desc="symlink returns ENAMETOOLONG if an entire length of either path name exceeded {PATH_MAX} characters"

The corresponding code for the failed use case is as follows:

n0=`namegen`
nx=`dirgen_max`
nxx="${nx}x"

mkdir -p "${nx%/*}"
expect 0 symlink ${nx} ${n0}
expect 0 unlink ${n0}

The test case is to create a symlink of length PATH_MAX (including the 0 at the end), and it fails to create a symlink of length PATH_MAX on the Tencent Cloud CFS.

Alibaba Cloud NAS

Compared to Tencent Cloud CFS, Alibaba Cloud NAS performs normally on symlink, but fails several test cases on chmod and rename.

chmod failure cases

In this test set, Aliyun NAS fails the following items:

/root/pjdfstest/tests/chmod/12.t ............
not ok 3 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_WRONLY : write 0 x : fstat 0 mode', expected 0777, got 04777
not ok 4 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 04777
not ok 7 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_RDWR : write 0 x : fstat 0 mode', expected 0777, got 02777
not ok 8 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 02777
not ok 11 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_RDWR : write 0 x : fstat 0 mode', expected 0777, got 06777
not ok 12 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 06777
Failed 6/14 subtests

This test set (chmod/12.t) is used to test the behavior of the SUID/SGID bits

desc="verify SUID/SGID bit behaviour"

Let's take the 11th and 12th of these test cases and explain them in detail, while overriding the two permission bits

# Check whether writing to the file by non-owner clears the SUID+SGID.
expect 0 create ${n0} 06777
expect 0777 -u 65534 -g 65534 open ${n0} O_RDWR : write 0 x : fstat 0 mode
expect 0777 stat ${n0} mode
expect 0 unlink ${n0}

Here, we first create the target file with permissions 06777, then modify the file contents to check if the SUID and SGID are cleared correctly. The 777 in the file permissions will be familiar to you, which corresponds to the rwx of owner, group and other, i.e. readable, writable and executable. The first 0 is an octal number.

The second bit, 6, needs to be explained. This octet represents special permission bits, the first two of which correspond to setuid/setgid (or SUID/SGID), which can be applied to executable files and public directories. When this permission bit is set, any user will be able to run the file as owner (or group). This special attribute allows users to gain access to files and directories that are normally available only to the owner. For example, the passwd command sets the setuid permission, which allows a normal user to change the password, since the file where the password is stored is only accessible to root and cannot be changed directly by the user.

The setuid/setgid design is intended to provide a way for users to access restricted files (not owned by the current user) in a restricted manner (specifying the executable). Therefore, this permission bit should be automatically cleared when the file is modified by a non-owner to prevent the user from gaining other permissions through this route.

From the test results, we can see that in Aliyun NAS, the setuid/setgid is not cleared when the file is modified by a non-owner, so in fact the user can operate as the owner by modifying the file content, which will be a security risk.

Read: Special File Permissions (setuid, setgid and Sticky Bit) (System Administration Guide: Security Services)

rename Failure Use Case

Aliyun NAS has a high number of failures on this test set, 24, all of which appear in rename/09.t.

desc="rename returns EACCES or EPERM if the directory containing 'from' is marked sticky, and neither the containing directory nor 'from' are owned by the effective user ID"

This test set is used to check the behavior of rename when the sticky bit is set: when the directory containing the source object has the sticky permission bit set, and both the owner of the source object and the containing directory are different from the effective user ID, rename should return either EACCES or EPERM. (Such complex logic is reminiscent of the martial arts skills set in the Three Kingdoms Warriors).

A typical application of the sticky bit is the /tmp directory, which allows everyone to create content, but only the owner can delete files. public upload directories inside FTP are usually set up in this way.

A few failed test cases show that Aliyun NAS support for sticky bits is not perfect, and that non-owner rename operations are not rejected and have a real effect - the source file is renamed. This behavior overrides the file system access control and threatens the security of user files.

Failed use case in Amazon EFS

Amazon Elastic File System (EFS) in the pjdfstest test not only failed a very high percentage (8811 test cases failed 1533), but also covered almost all categories, which is more surprising.

EFS supports mounting as NFS, but support for NFS features is incomplete. For example, EFS does not support block devices and character devices, which directly led to the failure of a large number of test cases in pjdfstest. After excluding these two types of files, there are still hundreds of different categories of failure, so the application of EFS in complex scenarios must be careful.

Summary

The above comparative analysis shows that JuiceFS performs best in terms of compatibility, sacrificing sub-second time accuracy and range (1970 - 2106 years) for performance, like most network file systems. The Alibaba Cloud NAS and Amazon EFS have the worst compatibility, with a large number of compatibility tests failing, including several test cases with serious security risks, so it is recommended to do a security assessment before use.

JuiceFS has always attached great importance to a high degree of compatibility with POSIX standards, and we have combined compatibility testing tools such as pjdfstest with other random and concurrent testing tools (such as fsracer, fstool, etc.) as integrated testing tools, and try our best to maintain the maximum degree of. In order to avoid users from falling into various traps in the process of use, and thus focus more on the development of their own business. Life is short, I use JuiceFS!

Related Posts

LLM Storage Selection & Detailed Performance Analysis of JuiceFS

2024-10-23
Explore storage solutions for large language models, comparing JuiceFS performance with CephFS, Lus…

Metabit Trading Built a Cloud-Based Quantitative Research Platform with JuiceFS

2024-08-14
Metabit Trading, an AI-based quantitative investment firm, used JuiceFS to build a cloud-based quan…

JuiceFS Evaluation with AWS EFS and FSx for Lustre

2024-08-07
This post compares JuiceFS with Amazon EFS and FSx for Lustre, focusing on features, performance, a…

MemVerge Chose JuiceFS: Small File Writes 5x Faster than s3fs

2024-07-31
As a US company specializing in memory-convergence infrastructure, MemVerge accelerated bioinformat…