High-performance TensorFlow* on Intel® Xeon® Using nGraph™

We recently announced the open source release of Intel® nGraph™, a C++ library, compiler and runtime suite for running Deep Neural Networks on a variety of devices. Today we are pleased to announce availability of simplified bridge code that can be used to link TensorFlow-based projects to pre-optimized nGraph backends. The bridge code implementation delivers up to 10X better performance1 compared to our initial TensorFlow integration.

Setup is simple (we’ve provided instructions in our bridge code repo): build TensorFlow as you normally would and select nGraph as a TensorFlow device, then modify a few lines of Python code in existing TensorFlow-based DL models to target nGraph. At present, nGraph supports workloads on Intel x86 processors and will support Intel®  Nervana™ Neural Network Processors (Intel® Nervana™ NNPs) when available. Future support for NVIDIA GPUs is also in the works.

nGraph performance on Intel Xeon processors

The new nGraph TensorFlow bridge provides substantially improved performance over the initial version open-sourced last month. As illustrated in Figure 1, the use of nGraph results in significant training performance improvements relative to the XLA CPU compiler. While it is important to note that the XLA CPU implementation is still experimental and has significant room for improvement, our measurements show that nGraph also compares very favorably to the state of the art in IA optimization for TensorFlow. Figure 2 compares the MKL-DNN-optimized TensorFlow implementation, which implements MKL-DNN optimizations directly in TensorFlow’s native compute engine, to the nGraph TensorFlow bridge code implementation. Incorporating nGraph yields better training performance on models like ResNet50 with the ImageNet dataset (68.9 images/sec on nGraph vs 64 images/sec on TF-MKL-DNN, an improvement of around 7 percent1). The results also indicate notable performance improvements on CIFAR10 ResNet models. We note, however, that this gap is attributable to a known issue related to overhead in the initialization of MKL-DNN primitives in the baseline TF-MKL-DNN implementation. A fix for that issue is being upstreamed to TensorFlow. Taken together, these results demonstrate that nGraph TensorFlow integration incurs little overhead and can achieve state-of-the-art IA performance for TensorFlow.

Figure 1: Training speed using TensorFlow-XLA (2S Intel(R) Xeon(R) Platinum 8180 CPU) yields 68.9 img/sec at peak performance on ResNet50-I1k with nGraph and 6.5 img/sec without.

Figure 2:   Training speed of ResNet50-I1k using TensorFlow-XLA-nGraph (2S Intel(R) Xeon(R) Platinum 8180 CPU) yields 68.9 img/sec at peak performance with nGraph compared to 64 img/sec using MKL-DNN-optimized TensorFlow

How we did it

Performance optimizations in the nGraph CPU backend

The nGraph CPU backend, sometimes called the “IA transformer”, implements many optimizations, delivering the most optimized performance for a given model on Intel CPU platforms:

  • Framework-independent optimizations: We leverage optimized kernel libraries like MKL-DNN and Eigen for fast DNN kernels. Additionally, we incorporate graph-optimization passes that choose optimal data layouts for these kernel implementations and reduce overall data layout conversions at the graph-level. We also fuse operations like BatchNorm and ReLu to better exploit fused kernel implementations that have lower memory requirements.
  • TF/XLA specific optimizations: TensorFlow has certain API-level quirks that lead to the addition of extra Pad ops in models. These ops can lead to unnecessary data layout conversions and can be avoided with padded convolution kernels that are available through MKL-DNN. We exploited this to fuse these pad ops into existing convolution operators, which gives a nice performance boost relative to a base TensorFlow model. In addition, XLA adds a few identity operators, like type conversion from Float32 to Float32, that can be eliminated from the graph without impacting correctness.

 

Fusing graph operations via pattern matching

The nGraph IR contains built-in primitives for many common deep-learning primitives, including convolution, padding, and batch normalization. These primitives were chosen because they can be mapped relatively easily onto the highly optimized compute kernels provided by back-end libraries, such as Intel MKL-DNN.

TensorFlow’s tf2xla translator decomposes some high-level TensorFlow operations into graphs defined in terms of low-level tensor operations. One frequently occurring case is TensorFlow’s average pooling operation. If the input tensor is padded, the operation is rewritten by tf2xla to a subgraph containing:

  1. a reduce-window operation to sum sliding windows of the graph;
  2. a second reduce-window operation to compute the divisors to apply to each summed window; and
  3. an element-wise division operation.

While each of these operations is directly supported in nGraph, they will not be nearly as performant as nGraph’s AvgPool primitive, which maps directly to an optimized implementation provided by MKL-DNN. A similar situation arises for max pooling, convolution backpropagation, and a number of other important primitives.

In order to address this, we implement an HLO fusion pass inside the TensorFlow-to-nGraph bridge. This fusion pass, illustrated in Figure 3 below, complements a number of framework-independent optimizations implemented within the nGraph CPU backend itself. When HLO fusion is incorporated, the translation from HLO to nGraph takes place in three steps. First, the raw graph (Figure 3a) generated by tf2xla is searched for patterns corresponding to high-level operations (indicated by orange nodes in Figure 3b). Second, each subgraph identified for fusion is wrapped in an HLO fusion instruction (Figure 3c). Finally, the post-fusion HLO graph is translated to an nGraph graph (Figure 3d), with fusion instructions being mapped to high-level nGraph operations.

Figure 3: Illustration of nGraph graph generation.

The fusion pass currently recognizes convolution back-propagation, forward- and back-propagation for average and max pooling, forward- and back-propagation for ReLU, and reduction operations like sum, product, and max. (Note that some other high-level operations, including batch normalization, are already treated as primitive in HLO, so there is no need to fuse them.)

How does it work?

TensorFlow’s XLA framework provides a mechanism for a “backend” device to register to receive computation graphs expressed in an HLO format and to provide an executable object that is capable of executing the computation at runtime. We developed an XLA plugin that registers as “NGRAPH” device, complies, and executes HLO computations.

A dynamically-loadable plugin framework

Currently, XLA plugin device source code needs to reside in the TensorFlow tree and must be built along with the rest of the TensorFlow source tree. Adding a new device requires understanding the TensorFlow code and the build system, both of which are fairly complicated. Moreover, upstreaming and maintaining the new code is difficult and sometimes undesirable as every change in the plugin implementation needs to be reviewed by the TensorFlow team, even when it may not be relevant to TensorFlow.

The  dynamically-loadable XLA plugin we developed, in which the actual XLA plugin source code resides outside of the TensorFlow source tree, is built into a Dynamic Shared Object (DSO) library with many of the nGraph optimizations inherent. On the XLA side, we created a plugin adapter that loads the plugin DSO and registers the XLA device using attributes supplied by the the plugin DSO.

Graph compilation and execution

From TensorFlow* computation graph to nGraph

TensorFlow* uses XLA  to create an HLO computation graph. The HLO’s Intermediate Representation (IR) is then handed over to the nGraph plugin DSO. The source for the bridge to nGraph resides in a separate GitHub repository named ngraph-tensorflow-bridge.

When workloads are deployed to target nGraph devices, the ngraph-tensorflow-bridge can transform the HLO IR into nGraph IR which is then compiled to produce executable code for a specific device. This executable is then returned to TensorFlow for subsequent execution of the computation.   

 

Figure 4: Transformation of TensorFlow graph to nGraph IR.

 

¹Configuration details
Hardware configuration: 2S Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz (28 cores), HT enabled, turbo enabled, scaling governor set to “performance” via intel_pstate driver, 384GB (12 * 32GB) DDR4 ECC SDRAM RDIMM @ 2666MHz (Micron* part no. 36ASF4G72PZ-2G6D1),800GB SSD 2.5in SATA 3.0 6Gb/s Intel Downieville SSDSC2BB800G701 DC S3520, client ethernet adapter: Intel PCH Integrated 10 Gigabit Ethernet Controller
Software configuration: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-109-generic x86_64). Datasets were hosted on NFS storage.
Software Release Versions:
  1. ngraph-tensorflow, Commit- c2cc26b
  2. ngraph-tensorflow-bridge, Commit- f9b9e5a
  3. ngraph, Commit- eec1922

Scripts:https://github.com/NervanaSystems/ngraph-tensorflow-bridge/tree/master/test/resnet

Command lines

Resnet cifar10: KMP_BLOCKTIME=1 OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact,1,0  python cifar10_main.py --data_dir /path/to/dataset --model_dir /path/to/saved_model/ --batch_size 128 --resnet_size $RESNET_SIZE --data_format channels_first --inter_op 2 --select_device NGRAPH
# (RESNET_SIZE was tested at 8, 20, 56) 

Resnet Imagenet(I1k): KMP_BLOCKTIME=0 OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact,1,0 python imagenet_main.py --data_dir /path/to/dataset --model_dir /path/to/saved_model/ --batch_size 128 --resnet_size 50 --data_format channels_first --inter_op 2 --select_device NGRAPH