Using Intel® Xeon® for Multi-node Scaling of TensorFlow* with Horovod*

TensorFlow* is one of the leading deep learning and machine learning frameworks today. Earlier in 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL) [1].  Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance.

Intel has mainly been reporting out Intel-optimized TensorFlow performance improvements on single node. However, many complex deep learning models are required to be trained on multi-node. They either don’t fit in one machine or their time-to-train can be significantly reduced if they are trained on a cluster of machines. Therefore, Intel has also performed scaling studies on multi-node cluster of Intel Xeon Scalable processors. This blog will show TensorFlow distributed training performance on a cluster of Intel® Xeon® platforms (system configurations to be found in the end) using Horovod, a distributed training framework for TensorFlow.

Horovod*, which was developed by Uber*, uses Message Passing Interface (MPI) as the main mechanism of communication. It uses MPI concepts such as allgather and allreduce to handle the cross-replicas communication and weight updates. OpenMPI* can be used with Horovod to support these concepts. Horovod is installed as separate Python package. By calling Horovod’s API from the Deep Learning Neural Networks model script, a regular build of TensorFlow can be used to run distributed training. By using Horovod, there is no source code change required in TensorFlow to support distributed training with MPI.

Scaling Results using Uber Horovod with TensorFlow 1.7

In this section, we will show the performance numbers of Intel-optimized TensorFlow 1.7 for Resnet-50 and Inception-V3 training running on up to 64 nodes containing Intel Xeon Gold processors. Real training dataset was used to perform these runs. As shown in the below charts, by running one MPI process per node, Resnet-50 was able to maintain at least 89.1% scalability for up to 64 nodes, while Inception-V3 could achieve at least 89.4%. So, with the higher throughput for Resnet-50 and Inception-V3, time-to-train is reduced significantly.

Although this study shows the scaling for up to 64 nodes, it is expected that the same scalability rate would carry over to 128 nodes.

The user can also run the same models by having 2 MPI processes running on each node. As shown in the charts below, we can get up to 17% and 24% performance improvements for Resnet-50 and Inceptionv3 respectively, with no extra hardware cost. Please note that the batch size per node remain the same as what we used for running 1 MPI process per node.

Thus by running two MPI process per node, as shown in the two graphs below, Resnet-50 was able to maintain at least 94.1% scalability for up to 64 nodes, while Inception-V3 could achieve at least 87.4%. So, higher throughput for ResNet-50 and Inception-V3, time-to-train is reduced significantly, even faster than using one MPI process per node.

 

Gathering and Installing Relevant Software Tools

  1. OpenMPI can be installed via yum on recent versions of CentOS. Some existing clusters already have available OpenMPI. In this blog, we will use OpenMPI 3.0.0. OpenMPI can be installed following instructions in the following link: https://www.open-mpi.org/software/ompi/v3.0/
  2.  Latest GCC version is needed. At least, GCC 6.2 or newer versions are recommended. For the latest installation, following link can be used: https://gcc.gnu.org/
  3.  Python 2.7 and 3.6 both tested.
  4.  Uber Horovod supports running TensorFlow in distributed fashion. Horovod can be installed as a standalone python package as follows:
    pip install –no-cache-dir horovod (e.g. horovod-0.11.3)
     Please check the following link to install Horovod from source: https://github.com/uber/horovod
  5. The current TensorFlow benchmarks are recently modified to use Horovod. You can obtain these benchmarks code from GitHub:
     git clone https://github.com/tensorflow/benchmarks
     cd benchmarks/scripts/tf_cnn_benchmarks
     Run tf_cnn_benchmarks.py as explained below.

Running TensorFlow Benchmark using Horovod with TensorFlow

Here, we discuss run commands needed to run distributed TensorFlow using Horovod framework. For hardware platform, we use dual-socket Intel® Xeon® Gold 6148 processor-based cluster system. For networking 10GB ethernet is used. Mellanox infiniband or Intel® Omni-Path Architecture (Intel® OPA) also can be used for networking the cluster.

Running 2 MPI processes on single node:

export LD_LIBRARY_PATH=<path to OpenMP lib>:$LD_LIBRARY_PATH
export PATH=<path to OpenMPI bin>:$PATH
export inter_op=2
export intra_op=18 {# cores per socket}
export batch_size=64 
export MODEL=resnet50 {or inception3}
export python_script= {path for tf_cnn_benchmark.py script}

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by socket  --overscribe 
--report-bindings -n 2 python $python_script --mkl=True --forward_only=False --num_batches=200 
--kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False 
--optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW 
--model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> 
--data_name <dataset_name>

For 1 MPI process per node, the configuration will be as follows. The other environment variables will be the same.

export intra_op=38
export batch_size=128 

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS --bind-to none --report-bindings  
-n 1 python $python_script --mkl=True --forward_only=False --num_batches=200 
--kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op 
--distortions=False --optimizer=sgd --batch_size=$batch_size 
--num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL 
--variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> 
--data_name <dataset_name>

Please note that if you want to train models to achieve good accuracy please use –distortions=True. You may also need to change the other hyper-parameters.

For running models on a multi-node cluster we will use a very similar run script as the one above. For example, to run on 64-node (2 MPI per node), where each node is Intel Xeon Gold 6148, the distributed training can be launched as shown in the below. All the export lists will be the same as above.

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by node  
--report-bindings -hostfile host_names -n 128 python $python_script --mkl=True 
--forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 
--num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size
 --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod 
--horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Here, the host_names file is the list of hosts on which you wish to run the workload.

What Distributed TensorFlow means for Deep Learning Training on Intel Xeon

Various efforts were taken to implement distributed TensorFlow on CPU and GPU. For example, gRPC, VERBS, TensorFlow built-in MPI. All of these technologies are incorporated within TensorFlow codebase. Uber Horovod is one distributed TensorFlow technology that was able to harness the power of Intel Xeon. It uses MPI underneath, and uses Ring based reduction and gather for Deep Learning parameters. As shown above, Horovod on Intel Xeon shows great scaling for existing DL benchmark models, such as Resnet 50 (up to 94%) and Inception v3 (up to 89%) for 64 nodes. In other words, time to train a DL network can be accelerated by as much as 57x (resnet 50) and 58x (inception V3) using 64 Xeon nodes comparing to a single Xeon node. Thus, currently Intel recommends TensorFlow users use Intel-optimized TensorFlow and Horovod MPI for multi-node training on Intel® Xeon® Scalable Processors.

Acknowledgments

The authors would like to thank Vikram Saletore, Mikhail Smorkalov, and Srinivas Sridharan for their collaboration with us on using Horovod and Horovod-based Intel MLSL with TensorFlow.

 

Notices & Disclaimers

Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #201108

Intel does not control or audit third-party benchmark data or the websites referenced in this document. You should visit the referenced website and confirm whether referenced data are accurate.

© Intel Corporation. Intel, the Intel® logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.

[1] Refer to https://github.com/01org/mkl-dnn for more details on Intel® MKL-DNN optimized primitives

System configuration: CPU: Intel Xeon Gold 6148 CPU @ 2.40GHz; OS: Red Hat Enterprise Linux Server release 7.4 (Maipo); TensorFlow Source Code: https://github.com/tensorflow/tensorflow; TensorFlow Commit ID: 024aecf414941e11eb643e29ceed3e1c47a115ad. Detailed configuration is as follows:

CPU
 Thread(s) per core:    2
 Core(s) per socket:    20
 Socket(s):             2
 NUMA node(s):          2
 CPU family:            6
 Model:                 85
 Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
 Stepping:              4
 HyperThreading:        ON
 Turbo:                 ON
Memory
 192GB (12 x 16GB)
 2666MT/s
Disks
 Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB)(?)
BIOS
 SE5C620.86B.00.01.0013.030920180427
OS
 Red Hat Enterprise Linux Server release 7.4 (Maipo)
 Kernel 3.10.0-693.21.1.0.1.el7.knl1.x86_64