Horovod Distributed Training on Kubernetes using MLT

Uber’s* Horovod is a great way to train distributed deep learning models. Its ring-allreduce network architecture scales well to several hundred nodes and it only requires a few simple changes in your Keras* or TensorFlow* code to get going.

The only bottleneck we’ve found to run jobs in Horovod is with the initial environment setup.  As with many deep learning workloads, it takes a bit of effort to get TensorFlow, Horovod, MPI, and the hardware interfaces to play well together. This is where Docker* and Kubernetes really shine.

At Intel, we have created an open-sourced project called Machine Learning Container Templates (MLT) which packages deep learning workloads into easy to deploy Docker containers for quick scaling on multiple devices via Kubernetes. We’ve handled all of the complex setup for users in advance to provide a great out of the box experience when developers are ready to deploy their custom deep learning model for distributed training.

In a past blog, we’ve shown how to train a brain tumor detection deep neural network using the TensorFlow parameter server distributed platform. We’ll take that same U-Net topology and the BraTS data from the previous blog and show how to package that code into Docker container with Horovod for quick deployment on Kubernetes.

Using MLT Horovod Template

This blog assumes you already have a Kubernetes cluster up and running and MLT installed.

Horovod Template

The MLT Horovod template leverages the OpenMPI component from Kubeflow to deploy distributed jobs for model training. We’ve incorporated IntelⓇ Math Kernel Library (IntelⓇ MKL) optimized TensorFlow in the CPU Dockerfile to significantly improve the training speed compared to vanilla TensorFlow7 when using IntelⓇ processors.

Data Transfer to Kubernetes cluster

You will need to register to get BraTS dataset. After downloading data, if you want to convert the raw Nifti MRI images to .npy format you can use our readily available script.

In this blog, we’ve used Volume Controller for Kubernetes (VCK) to copy data (.npy files) from Google* Cloud storage to the Kubernetes nodes. Please refer to this section to use VCK. Please take note of node affinity (on which nodes data being copied) and host path (path on the node), you will need it in a moment. 

Let’s jump into adapting the Horovod template to locate tumors in brain scan images.

Initialize the Horovod template using mlt init. During init you should tell which docker registry to use. MLT supports Docker Hub and Google Container Registry (Ex: gcr.io/<your_project_id> or docker.io/repo_name)

The unet-horovod directory now contains Dockerfiles, the training file and a custom deployment script to deploy the job on to Kubernetes cluster.

A complete unet example based on the above template is located here. All file details from example are explained as follows:

  • Dockerfile.cpu – CPU Dockerfile with the IntelⓇ Optimization for TensorFlow
  • mlt.json – This is the main configuration spec where you define parameters for your job. The user can provide extra parameters as below via template_paramaters section which is exposed as environment variables to deployment script deploy.sh.

  • main.py – Actual model script
  • requirements.txt – Extra python* pip dependencies required for your scripts
  • deploy.sh – Custom deployment script, it is user responsibility to provide required config to training scripts through mpirun exec command
    • Node Selector: If you want to launch jobs on a specific pool of nodes, you have to set this with a nodel label. In our case, we targeted nodes listed in node-affinity (because those are high memory nodes). Or you can completely ignore this flag, make sure you remove corresponding flag usage in the file.
    • Volume SupportYou will use above-copied hostPath here, it will be mounted as volumes in containers. Results written to this path are persisted.
      • Update same path for DATA_PATH and OUTPUT_PATH in settings.py
      • NOTE: Please make sure you use ksonnet version 0.11.0, for volume mount to work as expected.
  • data.py – The data loader. You can modify it if you have a different dataset to train.
  • model.py – The model implementation. This is where the U-Net model topology is coded. You can change it to your own topology.
  • convert.py – convert raw BraTS images to .npy files we have this script
  • preprocessor.py – To update channels

Once you have made the required modifications, use the mlt build command, to build and tag docker images. The initial build usually takes longer to complete because we have to build the IntelⓇ Optimization for TensorFlow using the OpenMPI base image. However, successive builds complete within a few seconds. You can make this faster for initial build by hosting the combined binary image (OpenMPI and the IntelⓇ Optimization for TensorFlow) on docker hub. You can refer to this Dockerfile.

To see what’s happening in the background use verbose flag as mlt build -v

Now it’s time to deploy, we use mlt deploy. This command first pushes images to the provided registry and uses images from there to deploy on to Kubernetes.

To check the status of above job use mlt status

Use mlt logs to see training progress and mlt events for job events in Kubernetes.

If you modify mlt.json Ex: (num_nodes: “4”, gpu:”2” etc.) pushing images to the registry is not needed, so you can just try the following:

  • Undeploy current job using mlt undeploy, which frees up resources
  • Deploy with mlt deploy --no-push (--no-push flag avoids pushing image again and it uses the same image) as there are no code changes.

If there are code changes, then rebuilding the images using  mlt build command is required followed by mlt deploy command to deploy the new job.

Model checkpoints are stored at OUTPUT_PATH (which is NFS mounted location) specified in mlt.json and settings.py file.  You can use tensorboard to view the training results.

tensorboard --logdir <OUTPUT_PATH>

Cluster Specifications

We used 4-node Kubernetes cluster, each node spec looks as below:

OS Version: CentOS Linux release 7.5.1804 (Core)
Kernel: 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
CPU: 2 X Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Memory: 256GB DDR4 1866 MHz
Disk: 400GB SSD on Sata0
Software:  TensorFlow 1.9, MKL-DNN 0.14
BIOS Information:
Vendor: Intel Corporation
Version: SE5C610.86B.01.01.0027.071020182329
Release Date: 07/10/2018
Microcode Update Driver: v2.01 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
Microcode Patch: microcode_ctl-2.1-29.10.el7_5.x86_64


The MLT template includes methods to output the final model, training checkpoints, and TensorBoard logs to either a local folder or a cloud storage location.  By taking advantage of both Kubernetes and TensorFlow’s MonitoredTrainingSession this deployment will have automatic failover built in: If one or more nodes fails during training, Kubernetes will attempt to restart the node and the TensorFlow session will restart training from the last good checkpoint!  This is a must-have feature for jobs that may run for hours or days at a time. You can monitor your job as it trains (or anytime afterward) by viewing the TensorBoard.  As you can see from the TensorBoard figures, our model does a pretty good job at finding brain tumors!


MLT’s built-in support for Horovod and TensorFlow gives you many advantages:

  • Rapidly scale, deploy, and undeploy distributed training workloads for deep learning.
  • Monitor training progress via TensorBoard logs.
  • Automatic failover support if nodes break.
  • Better usage of hardware resources among your Data Science team.


We thank Abolfazl Shahbazi and Dina Suehiro for their help and feedback.


  1. Sergeev A and M  Del Balso. “Horovod: fast and easy distributed deep learning in TensorFlow.” (21 Feb 2018) https://arxiv.org/pdf/1802.05799.pdf
  2. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features”, Nature Scientific Data, 4:170117 (2017) DOI: 10.1038/sdata.2017.117
  3. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.KLXWJJ1Q
  4. Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.GJQ7R0EF
  5. Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants BB, Ayache N, Buendia P, Collins DL, Cordier N, Corso JJ, Criminisi A, Das T, Delingette H, Demiralp Γ, Durst CR, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X, Hamamci A, Iftekharuddin KM, Jena R, John NM, Konukoglu E, Lashkari D, Mariz JA, Meier R, Pereira S, Precup D, Price SJ, Raviv TR, Reza SM, Ryan M, Sarikaya D, Schwartz L, Shin HC, Shotton J, Silva CA, Sousa N, Subbanna NK, Szekely G, Taylor TJ, Thomas OM, Tustison NJ, Unal G, Vasseur F, Wintermark M, Ye DH, Zhao L, Zhao B, Zikic D, Prastawa M, Reyes M, Van Leemput K. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694
  6. Ronneberger O., Fischer P. , and T. Brox. “U-Net Convolutional Networks for Biomedical Image Segmentation.” arXiv:1505.04597v1 [cs.CV] 18 May 2015


  1. https://www.ncbi.nlm.nih.gov/pubmed/25494501
  2. https://www.ncbi.nlm.nih.gov/pubmed/28872634
  3. https://github.com/IntelAI/mlt
  4. https://github.com/IntelAI/vck
  5. https://github.com/kubeflow/kubeflow
  6. https://eng.uber.com/horovod/

Refer to https://github.com/01org/mkl-dnn for more details on Intel® MKL-DNN optimized primitives

Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Performance results are based on testing as of 8/17/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel, Xeon, and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
 *Other names and brands may be claimed as the property of others.
© Intel Corporation.