Our Friend the Object Store

This is your Tensorflow I/O operation:

This is your Tensorflow I/O operation on S3:

Any questions?

…Oh wait, you have a ton of questions and doing this awesome thing interests you greatly?

What’s going on here?

Back in July 2017, Yong Tang added an S3 backend for Tensorflow’s Filesystem interface. This means almost anywhere that Tensorflow IO operations are used, an S3 path can be used instead.

To use this feature we’ll need to follow so few steps that we can enumerate them in a big bold font to make it look super simple.

Step 1: Define your S3 parameters

The S3 backend takes environment variables for its configurations. Start with the values below and modify them according to your S3 environment:

Step 2: Use Tensorflow

Next, use Tensorflow.

Take your favorite model and try it out! Simply swap any paths in your model with an S3 URL. For the linked model, this is controlled by an environment variable:

Or you can just do the smoke test we started this post with:

Also, almost every utility in the Tensorflow ecosystem will also respect an S3 path:

But Why?

When it comes to storing data, there are many options, each with benefits and drawbacks. The most common is the local filesystem. However, this is inherently unscalable and is a non-starter for distributed training. Shared filesystems are also available, but implementations tend to be rare on Cloud Service Providers, and an error on the server side can mean a hung training job, problematic mounts, or a node reboot. While an object store isn’t a panacea for your IO woes (worse, it may actually perform slower), it offers a resilience, simplicity, and ubiquity that shared filesystems can’t match.

What if I don’t have AWS?

While Amazon invented S3, the simple semantics of the interface have caused it to become the defacto object store API. Google’s Cloud Storage is interoperable, There are guides on proxying requests for Azure Blob Store, and countless others provide S3-compatible storage solutions.

However, one solution that stood out to me, especially for ease of use, was Minio. Minio is a distributed S3-compatible object store written in Go, it is SUPER simple to deploy, and has an amazingly responsive team.

I do most of my work on Kubernetes, and I was easily able to tailor their examples to deploy on a bare-metal cluster with no storageclass setup.

This means that even in your datacenter, with no existing storage solution, you can get up and running with S3 in no time!

Now what?

Try it out with your model, or check out the mnist example in the Kubeflow project for an end to end S3-based workflow.

Also, if you need more performance for data loading, check out Balaji Subramaniam’s Kube Volume Controller to cache S3 data locally to your workloads.

While S3 support in Tensorflow is relatively young, S3 has seen success throughout the IT industry, and this intersection of object storage and machine learning will allow us to look at our storage solutions in a new context.