Author Bio Image

Dina Suehiro Jones

Senior Cloud Software Engineer, Artificial Intelligence Products Group

Running distributed machine learning workloads has been a hot topic lately.  Intel has shared documents walk through the process of using Kubeflow* to run distributed TensorFlow* jobs with Kubernetes, as well as a blog on using the Volume Controller for Kubernetes (KVC) for data management on clusters, and a blog describing a real-world use case where a distributed TensorFlow model was used to predict the location of tumors in brain scan images. Although running distributed workloads on a cluster is something that sounds attractive to many data scientists, it’s almost never easy.

Introducing Machine Learning Container Templates

Intel is introducing Machine Learning Container Templates (MLT) v0.1.2. MLT is a new open-source command line tool used for streamlining the creation and deployment of machine learning jobs on Kubernetes. MLT bridges the gap between the data scientist and the infrastructure engineer by providing templates that serve as a starting point for machine learning jobs, and simple commands to build Docker images and deploy jobs on the cluster. At the same time, data scientists have complete flexibility to customize their application, because MLT templates include the raw ingredients that are used to build and deploy the job (such as the Dockerfile and Kubernetes resource file).

We believe that MLT is like the “Keras of Kubernetes” because it provides easy commands that enable data scientists to easily get started with running distributed model training jobs on Kubernetes, without having to be a DevOps expert.  

The MLT workflow

In creating their application, users begin by listing the different templates MLT has to offer and selecting the one that most closely resembles their use-case.  When the app is initialized, it creates a directory that includes a Dockerfile, a template for the Kubernetes job manifest, a configuration file, and the model training code.  In the example below, we list the templates and then initialize an app based on the distributed TensorFlow MNIST template.


$ mlt templates list
Template    	      Description
--------------       --------------------------------------------------------------------------------------------------
hello-world 	       A TensorFlow python HelloWorld example run through Kubernetes Jobs.
pytorch     	       Sample distributed application taken from http://pytorch.org/tutorials/intermediate/dist_tuto.html
pytorch-distributed    A distributed PyTorch MNIST example run using the pytorch-operator.
tf-dist-mnist          A distributed TensorFlow MNIST model which designates worker 0 as the chief.
tf-distributed         A distributed TensorFlow matrix multiplication run through the TensorFlow Kubernetes Operator.
 
$ mlt init my-app --template tf-dist-mnist --namespace dina
[master (root-commit) b2ded22] Initial commit.
 8 files changed, 502 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 Dockerfile
 create mode 100644 Makefile
 create mode 100644 README.md
 create mode 100644 crd-requirements.txt
 create mode 100644 k8s-templates/tfjob.yaml
 create mode 100644 main.py
 create mode 100644 requirements.txt

$ cd my-app/
$ ls
Dockerfile    	Makefile      	README.md     	crd-requirements.txt	k8s           	k8s-templates
main.py        mlt.json      	requirements.txt

$ mlt config list
Parameter Name                   Value
-------------------------------  ----------------------
gceProject                       cluster-123456
namespace                        dina
name                             my-app
template_parameters.num_ps       1
template_parameters.num_workers  2

 

MLT’s templates work out-of-the-box, which means that after the user has initialized their app, they can use MLT’s build and deploy commands to run the app on their Kubernetes cluster. After deploying the application, we check the status of the job and see that the pods are running on the cluster, then view the logs.  When the job is done, MLT’s undeploy command is used to delete the job and free resources on the cluster.


$ mlt build
Starting build my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12
Building - (Elapsed Time: 0:00:02)                                                                                                  
Built my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12

$ mlt deploy
Pushing  \ (Elapsed Time: 0:00:21)                                                                                                  
Pushed to gcr.io/cluster-123456/my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12
Deploying gcr.io/cluster-123456/my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12

Inspect created objects by running:
$ kubectl get --namespace=dina all

$ mlt status
TF Job:
NAME                                          AGE
my-app-62d0687d-a375-43ea-b4be-391a33750be9   5s

Pods:
NAME                                                           READY     STATUS    RESTARTS   AGE
my-app-62d0687d-a375-43ea-b4be-391a33750-ps-1cjq-0-lu16r       1/1       Running   0          6s 
my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-0-ywxoy   1/1       Running   0          6s
my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-1-0cbd1   1/1       Running   0          6s

$ mlt logs 
Checking for pod(s) readiness
Will tail 3 logs...
my-app-62d0687d-a375-43ea-b4be-391a33750-ps-1cjq-0-lu16r
my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-0-ywxoy
my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-1-0cbd1
...

$ mlt undeploy

Next, we can perform iterative development by making model updates, then rebuild and redeploy our app using MLT.  MLT also has a –watch command for doing automatic rebuilds when file changes are detected and an –interactive command which gives the user a shell into the container.  The Docker files in the MLT templates are designed such that subsequent builds to update source files are faster than the initial build.

Locating brain tumors using MLT

Now that we have walked through the basics of MLT, let’s look at how to adapt this app to a real-world use case. For this example, we will be using a U-Net model which predicts the location of tumors in brain scan images using the BraTS dataset. Details on this model are described in a previous blog.

To get the dataset downloaded onto the nodes in the Kubernetes cluster, we used the Volume Controller for Kubernetes (KVC).  (We won’t go through the whole process of using KVC; there is already a blog discussing this.)

After KVC finished downloading the dataset to the nodes, we got the node list and host path from the volumemanager custom resource and used this information to add a node affinity and volume mount in the k8s-templates/tfjob.yaml file.  In this same file, we also added environment variables with our cloud storage credentials and bucket information, since the model will be writing checkpoint files to the cloud.

Next, we replaced the original MNIST main.py file with the U-Net model training python files (from the GitHub repo here). We updated the Dockerfile to execute test_dist.py, which is the name of the main model training script.  Because we are using the TFJob operator, modifications were needed in order to have the model training script get the cluster information (list of workers and parameters server, job name, and task index) from the TF_CONFIG environment variable instead of from flags.  Lastly, we updated the requirements.txt file to include libraries that are required to run this particular model.

After making those changes, we are ready to rebuild and deploy the model using MLT:


$ mlt build
Starting build distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a
Building \ (Elapsed Time: 0:02:26)                                                                                                                                                                      
Built distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a

$ mlt deploy
Pushing  | (Elapsed Time: 0:01:43)                                                                                                                                                                      
Pushed to gcr.io/cluster-123456/distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a
Deploying gcr.io/cluster-123456/distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a

Inspect created objects by running:
$ kubectl get --namespace=dina all

$ mlt status
TF Job:
NAME                                                    AGE
distributed-unet-e925bc66-3766-4bf2-bbdd-cd20eb0cb033   3m

Pods:
NAME                                                           READY     STATUS    RESTARTS   AGE
distributed-unet-e925bc66-3766-4bf2-bbdd-ps-fkoj-0-4kupn       1/1       Running   0          3m
distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-0-yc96p   1/1       Running   0          3m
distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-1-divnx   1/1       Running   0          3m
distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-2-zac9u   1/1       Running   0          3m

For our model, checkpoint files are saved to a cloud storage location, so we can point TensorBoard to that location and watch the progress of the model training. The screenshot below was taken after letting it train for a while. The first row of images is the ground truth, the second row has the brain scan image, and the last row is our prediction. As you can see, the predictions are getting pretty close!

We have demonstrated a simple example that runs an MLT template out-of-the-box, as well as a more complex use-case that uses volume mounts from KVC and modifies the MLT template to run the U-Net model on a dataset of brain scan images.  We found that starting out with a simple working example helped ease the process of moving to a more complex use-case.  And of course, it didn’t work on the first try, so MLT’s build and deploy features helped to simplify the iterative development process.

What’s next?

We are constantly adding new features to MLT in order to further streamline the process of deploying machine learning workloads on Kubernetes. Upcoming features in our pipeline include a Horovod* template, code syncing commands in order to reduce the iteration time, and a hyperparameter experiments template.

It’s a simple pip command to install MLT.  Visit us on GitHub for additional information and instructions on getting started:

https://github.com/IntelAI/mlt

After you’ve used MLT, please share with us any feature requests or ideas you may have to improve the process of running machine learning jobs using Kubernetes.

Related links

Intel AI Blog: Biomedical Image Segmentation with U-Net:

https://ai.intel.com/biomedical-image-segmentation-u-net/

MLT example using the distributed U-Net model:

https://github.com/IntelAI/mlt/tree/master/examples/distributed_unet

Distributed U-Net model:

https://github.com/NervanaSystems/topologies/tree/master/distributed_unet

Information on using the BraTS datasets:

https://github.com/NervanaSystems/topologies/tree/master/distributed_unet#required-data

https://www.smir.ch/BRATS/Start2016

Volume Controller for Kubernetes (KVC):

https://github.com/intelai/experimental-kvc

The Kubeflow TF-Operator:

https://github.com/kubeflow/tf-operator

Notices and Disclaimers

Whenever using and/or referring to the BraTS datasets in your publications, please make sure to cite the following papers.

  1. https://www.ncbi.nlm.nih.gov/pubmed/25494501
  2. https://www.ncbi.nlm.nih.gov/pubmed/28872634

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

© Intel Corporation

*Other names and brands may be claimed as the property of others.

Author Bio Image

Dina Suehiro Jones

Senior Cloud Software Engineer, Artificial Intelligence Products Group

Related Blog Posts

Using Intel® Xeon® for Multi-node Scaling of TensorFlow* with Horovod*

TensorFlow* is one of the leading deep learning and machine learning frameworks today. Earlier in 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL) [1].  Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance. Intel has…

Read more

#Technical Blog

Analyzing and Understanding Visual Data

Currently, more than 75% of all internet traffic is visual (video/images). Total traffic is exploding, projected to jump from 1.2 zettabytes per year in 2016, to 3.3 zettabytes in 2021, and visual data will comprise roughly 2.6 zettabytes of that. A major challenge for applications is how to process and understand this visual data, a…

Read more

#Research #Technical Blog

Introducing NLP Architect by Intel AI Lab

Many advances in Natural Language Processing (NLP) and Natural Language Understanding (NLU) in recent years have been driven by advancements in the field of deep learning with more powerful compute resources, greater access to useful data sets, and advances in neural network topologies and training paradigms. At Intel AI Lab, our team of NLP researchers…

Read more

#Research #Technical Blog