Use JuiceFS on Colab with Google Cloud SQL and GCS
Colaboratory, or "Colab" for short, is a product by Google Research. Colab enables users to write and execute arbitrary Python code through the browser. It is particularly well suited for machine learning, data analysis, and educational purposes.
Colab supports Google Drive for uploading files to or downloading files from Colab instances. However, in some cases, Google Drive might not be that convenient to use with Colab. This is where JuiceFS can a valuable tool, enabling easy file synchronization between Colab instances or between a Colab instance and a local or on-premises machine.
A demo Colab notebook using JuiceFS is available here.
This document outlines the necessary steps for using JuiceFS in the Colab environment. We use Google Cloud SQL as the JuiceFS metadata engine and Google Cloud Storage (GCS) as the JuiceFS object storage.
For other types of metadata engines or object storages, see How to Set Up a Metadata Engine and How to Set Up Object Storage.
Many of the steps mentioned here will be quite similar to the Getting Started document, which you can also use for reference.
Summary of steps
- Format a
juicefs
file system from any machine or instance with access to Google Cloud resources. - Mount the
juicefs
file system in a Colab Notebook - Store sharing files across machines and platforms.
Prerequisites
This demo uses Google Cloud Platform's Cloud SQL and Google Cloud Storage (GCS) to create a high-performance file storage system of JuiceFS. You need a Google Cloud Platform account to follow this demo document.
If you have another cloud vendor's resources (such as AWS RDBS and S3), you can still use this guide as a reference and with other reference documents provided by JuiceFS to achieve a similar solution.
To make JuiceFS reach the best performance, you might also want the Colab instance is in the same zone or close to the region where Cloud SQL and GCS are deployed. The tutorial works for a randomly hosted Colab instance, but you might notice slower performance due to the latency between the Colab instance and the Cloud SQL/GCS regions. To start Colab instances in a specific region, see instructions for starting a GCE VM on Colab via GCP Marketplace.
Before diving into the detailed steps, ensure you have the following resources ready:
- A Google Cloud Platform account ready and a project created. This demo uses a GCP project
named
juicefs-learning
. - A Cloud SQL (Postgres) ready for use. This demo uses the
juicefs-learning:europe-west1:juicefs-sql-example-1
instance as the metadata service. - A GCS bucket created as the object storage service. This demo uses
gs://juicefs-bucket-example-1
as the bucket to store file chunks. - An IAM service account or an authorized user account that has write access to the Postgres server and GCS buckets.
Detailed steps
Step 1: Format and mount a JuiceFS file system folder
This step needs to be done only once, and you can choose to execute it on any machine or instance where you have good connectivity and access to your Google Cloud resources.
-
Use
gcloud auth application-default login
to prepare a local credential, or useGOOGLE_APPLICATION_CREDENTIALS
to set up a JSON key file. -
Use
cloud_sql_proxy
to open a port (in this case, 5432) locally to expose your cloud Postgres service to your local machine:gcloud auth application-default login
# Or set up the json key file via GOOGLE_APPLICATION_CREDENTIALS=/path/to/key
cloud_sql_proxy -instances=juicefs-learning:europe-west1:juicefs-sql-example-1=tcp:0.0.0.0:5432 -
Use the following command to create a new file system named
myvolume
using thejuicefs format
command. Later, you can mount this file system on any other machines or instances where you have access to your cloud resources.You can download
juicefs
here.juicefs format \
--storage gs \
--bucket gs://juicefs-bucket-example-1 \
"postgres://postgres:mushroom1@localhost:5432/juicefs?sslmode=disable" \
myvolume
Note that this step is only required once on any machine you prefer to work on.
Step 2: Mount the JuiceFS file system on Colab
Once you have completed Step 1, it means you already have a JuiceFS file system (named myvolume
in this case) defined and ready to be used.
Now, let's open a Colab page and execute the following commands to mount our file system into a folder named mnt
.
Firstly, download the juicefs
binary and do the same as Step 1 to get GCP credentials and open the Cloud SQL proxy.
Note that the following commands are run in the Colab environment, so there is a !
mark at the beginning for running shell commands.
-
Download
juicefs
to the Colab runtime instance:! curl -sSL https://d.juicefs.com/install | sh -
-
Set up Google Cloud credentials:
! gcloud auth application-default login
-
Open
cloud_sql_proxy
:! wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
! chmod +x cloud_sql_proxy
! GOOGLE_APPLICATION_CREDENTIALS=/content/.config/application_default_credentials.json nohup ./cloud_sql_proxy -instances=juicefs-learning:europe-west1:juicefs-sql-example-1=tcp:0.0.0.0:5432 >> cloud_sql_proxy.log & -
Mount the
myvolumn
JuiceFS file system onto themnt
folder:! GOOGLE_APPLICATION_CREDENTIALS=/content/.config/application_default_credentials.json nohup juicefs mount "postgres://postgres:mushroom1@localhost:5432/juicefs?sslmode=disable" mnt > juicefs.log &
Now you should be able to use the mnt
folder as if it were a local file system folder to write and read folders and files in it.
Step 3: Load data at another time or on another instance
With data stored in the JuiceFS file system in Step 2, you can repeat all the operations mentioned in Step 2 at any time on any other machines to access the previously stored data or to store more data into it.
Congratulations! Now you have learned how to use JuiceFS, specifically with Google Colab to conveniently share and store data files in a distributed fashion.
Feel free to explore a demo Colab notebook using JuiceFS here.
Happy coding :)