Author Bio Image

Jason Knight

Senior Technology Officer, Artificial Intelligence Products Group

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend connectors to popular deep learning frameworks. We also see our set of independent modules and utilities (serialization, visualization, autodiff) as reusable components for building the frameworks of the future.

Since our technical preview release late last year, we have been working on enabling bigger models with higher performance and more frontend framework connectors. Let’s look at these improvements one at a time starting with the frontends:

TensorFlow/XLA Support

With the experimental release of XLA and associated APIs earlier this year, we’ve jumped on the chance to integrate Intel Nervana Graph’s backends and optimization pass infrastructure into TensorFlow at a deeper level. This takes our previous out-of-band TensorFlow model conversion routine from before and instead enables a seamless user experience that allows most existing TensorFlow scripts and utilities (including TensorBoard) to work out of the box without any modifications.

The connection between the XLA and Intel Nervana Graph APIs was quite straightforward given the similar projects’ intent for a compact and explicit intermediate representation.

While today the XLA/Intel Nervana Graph integration is at a pre-alpha level, we’d love for people to take it for a spin and kick the tires. We’re working on ironing out known performance issues and improving op and backend support.


We now have preliminary support for Caffe models with command line emulation and python converter support.

Distributed Training Support

We’ll be speaking more about this more in an upcoming blog post with additional technical details. The heterogeneous transformer now supports training across multiple CPUs or GPUs. This support will expand to eventually encompass multiple hosts as well.

Memory Optimizations

Moving to hardware independent optimizations, using liveness analysis on the dataflow graphs produced by frontends such as neon and TensorFlow, we are able to evaluate which temporary arrays are no longer needed and reclaim their memory for additional computations. This results in 5 to 6 times less memory usage across a range of smaller test models.

In the view above, the green bar represents the amount of memory used directly by the operation in question, and the red bar represents the amount of memory that is still live at this point in the computation. This feature will be landing soon, and there are more improvements still to come.

Leveraging MKL-DNN

On the backend side of things, we have made considerable progress to utilize Intel’s open sourc MKL-DNN library to greatly accelerate deep learning computations on Intel processors. So far we’ve completed integration of MKL-DNN APIs into the ngraph CPU backend for Relu, Pooling, BatchNormalization and element-wise additions. In addition, we consider a full-graph optimization for tensor layout to optimize MKL-DNN performance for its hand tuned kernels.

This effort also included implementing a new pattern matching toolkit to allow for optimization passes to recognize arbitrarily complex computational patterns (think regular expressions for graphs). This pattern matcher was then used to detect operator compounding (aka “fusion”) opportunities starting with matrix multiplications followed by bias that offer increased performance from the MKL-DNN library.

Expect more announcements from this integration as we begin our benchmarking process and continue the optimization efforts on Intel architectures.

neon 3.0

In addition to the highlighted Intel Nervana Graph improvements above, we are also releasing a technical preview of neon 3.0. Neon is Intel’s reference implementation of a deep learning framework designed for high performance and high productivity. Paired with a library of recent topologies that are available for easy reuse in our Model Zoo, this enables users to quickly get up and running to solve their data science needs.

The neon 3.0 technical preview includes more layer types, more pre-assembled models, and more APIs to make new layers and new models even easier to build. For two examples of new layer types, we now include the Connectionist Temporal Classification (CTC) cost function as implemented in Baidu’s ctc-warp, and an abstraction around RNN layers that should make it easy for anyone to implement recurrent layers with arbitrary internal computations with minimal boilerplate.  

We added several models to demonstrate recent topologies and the usage of these new features. For examples, Residual networks using CIFAR 10 dataset, Deep Convolutional Generative Adversarial Networks (DCGAN) using MNIST dataset, a speech model with bi-directional RNNs and CTC using Librispeech, a LSTM network to do IMDB movie review sentiment analysis, a character-based seq2seq model, and a deep Q-network (DQN) playing a Atari game.

And finally, we have introduced the first pieces of a query selector API in neon that allows for users to easily query and manipulate operators in the Intel Nervana computation graph that underlies the higher-level model objects. This allows users to build up a model by specifying a series of layers, then query for “all trainable tensors in convolution layers which are followed by pooling”, or “all element-wise operators not in convolutional layers”. Then with the output of this query, the user can modify the attributes of those Intel Nervana Graph operators or add debug operators. The selector API is still at a very early stage, but we are excited for its potential so please feel free to give us feedback on the design and intended use cases.

Join us

Join us by making pull requests, suggestions, and comments on GitHub. We are also hiring for full-time and internship positions.

Author Bio Image

Jason Knight

Senior Technology Officer, Artificial Intelligence Products Group

Related Blog Posts

neon v2.3.0: Significant Performance Boost for Deep Speech 2 and VGG models

We are excited to announce the release of neon™ 2.3.0.  It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the Intel® Math Kernel Library (Intel® MKL) backend over the NumPy CPU backend with…

Read more

#Release Notes

BDW-SKX Normalized Throughput

neon v2.1.0: Leveraging Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…

Read more

#Release Notes

neon™ 2.0: Optimized for Intel® Architectures

neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we…

Read more

#Release Notes