neon™ 2.0: Optimized for Intel® Architectures
Jun 28, 2017
Jun 28, 2017
neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we are excited to announce neon™ 2.0 with optimizations for Intel CPUs using Intel’s Math Kernel library (Intel® MKL).
On an Intel Xeon® processor E5 v4 server platform (code named Broadwell), the optimized implementation provides up to 98x speedup on popular benchmarks and topologies. For example, GoogLeNet v1 inference throughput is 539 images/sec on the Xeon platform, enabling high throughput inference with neon on CPUs. Neon also demonstrates state of the art CPU performance on topologies such as ResNet-50 (training throughput at 42 images/sec on Xeon systems). Users are also expected to see improved performance on Intel Xeon (code named Skylake) and Intel Xeon Phi (code named Knights Mill) coming out later this year. We hope that these optimizations will allow data scientists and machine learning researchers to leverage readily available CPUs to develop deep learning models.
Intel® MKL library provides CPU optimized implementations for widely used primitives like convolution, pooling, activation functions and normalization. These MKL primitives exploit the full vectorization and parallelization capabilities of Intel Architecture in contrast to existing vanilla implementations. Information for other MKL-optimized frameworks can be found for Tensorflow, MXnet, and Caffe.
We have developed a new neon backend (NervanaMKL) that utilizes MKL primitives where available. The following neon ops are currently optimized with MK: 2D direct convolution, Pooling, Relu, BatchNorm, MergeSum and MergeBroadcast.
To achieve peak performance, MKL primitives require N-dimensional input data to be laid out in specific SIMD-friendly formats. To reduce the burden on neon users, we have incorporated the plumbing required for automatic data layout tracking and conversion into the NervanaMKL backend. We have also rewritten elementwise operations using OpenMP to speed up execution. Together all these optimizations provide a significant performance boost for both training and inference tasks.
We have validated the correctness of the implementation on a variety of models that are provided along with the neon framework. Performance has been optimized for various ImageNet-based models like Alexnet, GoogLeNet-v1, and ResNet. Figure 1 shows the training performance improvement for Convnet-Alexnet, Convnet-GoogLeNet v1, and ResNet-50 (with real image dataset) with the neon MKL backend on a Intel Xeon system.
Figure 1: Performance improvement with Intel MKL on Intel Xeon processor E5 v4 (codename Broadwell) CPUs
We encourage users to check out MKL-optimized neon v2.0 and try out their favorite models on IA platforms. The DNN component of MKL is provided free of charge and downloaded automatically as part of the neon installation.
Future neon v2.x releases will feature performance optimizations for a broader range of models including GANs and DeepSpeech2. Neon v3.0 will feature Intel® Nervana™ Graph support, enabling multinode training, new models such as ResNet-Inception, SSD, and a wide range of Reinforcement Learning models.
1) Install prerequisites sudo apt-get install python-pip python-virtualenv libhdf5-dev libyaml-dev pkg-config 2) Get and install neon v2.0 git clone https://github.com/NervanaSystems/neon.git cd neon make 3) Activate virtualenv in neon root directory . .venv/bin/activate 4) Run basic neon examples without MKL backend under the neon root directory to get neon baseline performance (-e 1 means running for just 1 epoch) python examples/cifar10_conv.py -e 1 5) Run basic neon examples with MKL backend under the neon root directory to get boosted neon performance python examples/cifar10_conv.py -b mkl -e 1
Figure 2: BDW system configuration
neon™ 2.0 Key Contributors:
Peng Zhang (Development), Wei Wang (Benchmarking and Documentation), Dawn Stone (Validation)
Jayaram Bobba is a senior software engineer in the AI Products Group at Intel. He works on graph compilers and CPU optimizations for Machine Learning frameworks like Tensorflow and neon. He also leads the team developing the CPU backend for Nervana Graph. During his time at Intel, he has contributed to many binary translation projects and HW/SW codesign ranging from microarchitecture enhancements to SW algorithm improvements. Jayaram has a PhD from University of Wisconsin in Computer Architecture.
Peng Zhang is a software engineer in the Software Service Group at Intel. He works on the optimization of Deep Learning frameworks including Neon and Torch. He has a Master’s Degree from Tsinghua University in Control Science and Engineering.
Wei Wang is a software engineer in the AI Products Group at Intel. He works on benchmarking Machine Learning frameworks like Neon and Caffe. He has a PhD from University of Delaware in High-Performance Computing (HPC).
Dawn Stone is a software engineer in the AI Products Group at Intel. She works on validating Machine Learning frameworks including Intel Nervana™ Graph and neon™.
We are excited to announce the release of neon™ 2.3.0. It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the Intel® Math Kernel Library (Intel® MKL) backend over the NumPy CPU backend with…
We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…
Highlights from this release include: * Skip Thought Vectors example * Dilated convolution support * Nesterov Accelerated Gradient option to SGD optimizer * MultiMetric class to allow wrapping Metric classes * Support for serializing and deserializing encoder-decoder models * Allow specifying the number of time steps to evaluate during beam search * A new community-contributed Docker image…
Keep tabs on all the latest news with our monthly newsletter.