nGraph: A New Open Source Compiler for Deep Learning Systems

We are pleased to announce the open sourcing of nGraph, a framework-neutral Deep Neural Network (DNN) model compiler that can target a variety of devices. With nGraph, data scientists can focus on data science rather than worrying about how to adapt their DNN models to train and run efficiently on different devices. Continue reading below for highlights of our engineering challenges and design decisions, and see GitHub, our documentation, and our SysML paper for additional details.

Figure 1 – nGraph ecosystem.

We currently support TensorFlow*, MXNet*, and neon directly through nGraph. CNTK*, PyTorch*, and Caffe2* are supported indirectly through ONNX. Users can run these frameworks on several devices: Intel Architecture, GPU, and Intel Nervana Neural Network Processor (NNP). Support for future devices/frameworks in our roadmap is faded.

Why did we build nGraph?

When Deep Learning (DL) frameworks first emerged as the vehicle for running training and inference models, they were designed around kernels optimized for a particular device. As a result, many device details were being exposed in the model definitions, complicating the adaptability and portability of DL models to other, or more advanced, devices.

The traditional approach means that an algorithm developer faces tediousness in taking their model to an upgraded device. Enabling a model to run on a different framework is also problematic because the developer must separate the essence of the model from the performance adjustments made for the device, translate to similar ops in the new framework, and finally make the necessary changes for the preferred device configuration on the new framework.

We designed the nGraph library to substantially reduce these kinds of engineering complexities. While optimized kernels for DL primitives are provided through the project and via libraries like Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), there are also several compiler-inspired ways in which performance can be further optimized.

How does it work in practice?

Install the nGraph library and write or compile a framework with the library in order to run training and inference models. Specify nGraph as the framework backend you want to use from the command line on any supported system. Our Intermediate Representation (IR) layer handles all the device abstraction details and lets developers focus on their data science, algorithms and models, rather than on machine code.

At a more granular level of detail:

  • The nGraph core creates a strongly-typed and device-neutral stateless graph representation of computations. Each node, or op, in the graph corresponds to one step in a computation, where each step produces zero or more tensor outputs from zero or more tensor inputs. Our philosophy is that nGraph ops should serve as building blocks for more complex DNN operations found in DL frameworks. This is balanced by the need for efficient compilation and deriving training computations from inference computations.
  • We’ve developed a framework bridge for each supported framework; it acts as an intermediary between the nGraph core and the framework. We currently have bridges for TensorFlow/XLA, MXNet, and ONNX. Since ONNX is only an exchange format, the ONNX bridge is augmented by an execution API.
  • A transformer plays a similar role between the nGraph core and the various devices; transformers handle the device abstraction with a combination of generic and device-specific graph transformations. The result is a function that can be executed from the framework bridge. Transformers also allocate and deallocate, as well as read and write tensors under direction of the bridge. We currently have transformers for Intel Architecture, Intel NNP, NVIDIA cuDNN, and additional devices under active development.

Current Performance

Intel has considerable experience with MKL-DNN optimization of frameworks for Intel Architecture. We make use of previous work with the added benefit that optimizations developed for a device benefits all frameworks through nGraph. Framework developers continue to perform their own optimization work. For example, the performance for TensorFlow 1.7+/XLA on Intel Architecture is much better than for TensorFlow 1.3/XLA on Intel Architectures, and this should improve further as more work is put into XLA for Intel Architectures.

We present below initial performance data from multiple frameworks that reflects the optimizations done so far on the IA transformer. On the latest Intel Xeon Platinum 8180 processor, in conjunction with MKLDNN v0.13, we are able to meet or greatly exceed the performance of previously optimized frameworks such as MXNet-MKLDNN-CPU (MXNet optimized with MKLDNN) and neon-MKLML-CPU (neon optimized with MKLML). We also deliver better performance than the TensorFlow XLA compiler (TF-XLA-CPU), but there are significantly more optimizations that can be done with XLA both on the default CPU implementation and on nGraph.

Status and Future Work

As of today, nGraph supports six DL frameworks and three compute devices.

Supported frameworks:

  • Direct support through nGraph’s framework-independent representation
    • TensorFlow*
    • MXNet*
    • neon
  • Indirect support through ONNX
    • CNTK*
    • PyTorch*
    • Caffe2

Supported compute devices:

  • Intel Architecture (x86, Intel® Xeon® and Xeon Phi®)
  • Intel® Nervana™ Neural Network Processor (Intel® Nervana NNP)
  • NVIDIA* cuDNN (in progress)

We will continue to add support for additional devices and more graph optimizations such as device-specific op fusions, better work schedulers and faster custom op kernels.

Visit our GitHub repository to learn how to contribute to nGraph.

Configuration InformationCPU Configuration:Architecture:          x86_64CPU op-mode(s):        32-bit, 64-bitByte Order:            Little EndianCPU(s):                112

On-line CPU(s) list:   0-111

Thread(s) per core:    2

Core(s) per socket:    28

Socket(s):             2

NUMA node(s):          2

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 85

Model name:            Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

Stepping:              4

CPU MHz:               1000.585

CPU max MHz:           3800.0000

CPU min MHz:           1000.0000

BogoMIPS:              4989.29

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              1024K

L3 cache:              39424K

NUMA node0 CPU(s):     0-27,56-83

NUMA node1 CPU(s):     28-55,84-111

OS:

Ubuntu 16.04.3 LTS

Environment Variables:

export OMP_NUM_THREADS=56

export KMP_AFFINITY=granularity=fine,compact,1,0

export LD_LIBRARY_PATH=/path/to/ngraph_dist/lib

MXNet (nGraph and Direct Optimization):

export MXNET_NGRAPH_GRAPH_OPTIMIZATION=1

python example/image-classification/train_cifar10.py –num-layers=8

python example/image-classification/train_cifar10.py –num-layers=20

python example/image-classification/train_imagenet.py –num-layers=50 –data-train=/path/to/train.rec –data-val=/path/to/val.rec

TensorFlow/XLA:

Baseline- Tensorflow 1.3 built with JIT compilation

*Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.