neon v2.3.0: Significant Performance Boost for Deep Speech 2 and VGG models

Nov 14, 2017

Author Bio Image

Wei Wang

Deep Learning Software Engineer, Artificial Intelligence Products Group

Author Bio Image

Peng Zhang

Software Engineer, Software Service Group

Author Bio Image

Jayaram Bobba

Principal Engineer, Artificial Intelligence Products Group

We are excited to announce the release of neon™ 2.3.0.  It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the 

(Intel® MKL) backend over the NumPy CPU backend with neon™ 2.3.0, and more than 2X improvement2,4 over neon™ 2.2.0. For the VGG-16 model, we observed up to 2.8X3,4 performance improvement over neon™ 2.2.0.

To improve performance on DS2, we added data layout optimizations to make memory access more compatible with Intel® MKL (version:mklml_lnx_2018.0.20170908). We also added more optimized C/OpenMP* kernels replacing vanilla python implementations. These optimizations boosted neon™ 2.3.0 DS2 performance on IA by as much as 6.8X when compared to performance achieved by neon’s NumPy CPU backend (see Figure 1).

 

Figure 1: Training performance comparison of neon Deep Speech 2 with v2.2.0 and v2.3.0 for the NumPy CPU backend and the faster Intel® MKL backend running on a dual-socket Intel® Xeon® Platinum 8180 platform.

For the VGG-16 and AlexNet models, we improved training throughput by more than 2X.  We’ve added an optimization that fuses the convolution and bias layers in neon to minimize (or eliminate) costly tensor layout conversions in between layers. This promises good performance for models like VGG-16 and Alexnetboth of which have convolutional layers with bias as the basic layers. Figure 2 shows the training performance comparison of AlexNet and VGG-16 between neon v2.2 and neon v2.3 for the MKL backend on a dual-socket Intel Xeon Platinum 8168 platform.  After fusing the two layers, we maximized the time that code is executing in accelerated Intel® MKL mode to minimize the overall time it took to run the models. As a result, the performance improved up to 2.1X4 for AlexNet and up to 2.8X4 for VGG-16.

Figure 2: The training performance of AlexNet and VGG using the Intel® MKL backend of neon v2.2.0 and v2.3.0 on a dual-socket Intel® Xeon® Platinum 8168 platform. ImageNet dataset (1000 classes) was used.  

We’re continuing to improve the performance of different models, especially SSD and GAN models. Please stay tuned for improved neon performance on Intel architectures!

The authors would like to acknowledge Hanlin Tang’s help in realizing the fusion of convolution and bias across all three backends in neon, Baojun Liu’s validation effort, and Wadim Sokolowski’s help in generating the performance data for AlexNet and VGG-16.

 

Configuration details: 2S Intel® Xeon® Platinum 8180 CPU@2.5GHz (28 cores), HT enabled, turbo on, 192GB ECC RAM, CentOS Linux Release 7.3, Linux Kernel 3.10.0.514.el7.x86_64, Intel SSD 400GB. neon™ 2.3.0 (https://github.com/NervanaSystems/neon/commits/v2.3.0) commit id 9eb09d79cb65c4854db9f5d0140ad7a8e247359e and DS2 (https://github.com/NervanaSystems/deepspeech) commit id e55159d56569ff16c75369141a42fd5589d83279 were used to run. DS2 batch size 32 and speech sample length of 30 seconds were used. For Intel® MKL backend, KMP_AFFINITY and OMP_NUM_THREADS settings are “export KMP_AFFINITY=compact,0,1,verbose” and

“export OMP_NUM_THREADS=56”. For NumPy backend, NumPy was used without BLAS or LAPACK libraries. icc version 17.0.3.191 was used. LibriSpeech dataset (100 hours) was used.

Configuration details: Same as above but neon™ 2.2.0 (https://github.com/NervanaSystems/neon/commits/v2.2.0) commit id 5843e7116d880dfc59c8fb558beb58dd2ef421d0 was used.

3 Configuration details: 2S Intel® Xeon® Platinum 8168 CPU@2.7GHz (24 cores), HT enabled, turbo on, 192GB ECC RAM, OS: Ubuntu-14.04-trusty, Intel SSD 800GB. neon™ 2.3.0 (https://github.com/NervanaSystems/neon/commits/v2.3.0) commit id 9eb09d79cb65c4854db9f5d0140ad7a8e247359e and DS2 (https://github.com/NervanaSystems/deepspeech) commit id e55159d56569ff16c75369141a42fd5589d83279 were used to run. AlexNet batch size 256 and VGG-16 batch size 64 were used. For Intel® MKL backend, KMP_AFFINITY and OMP_NUM_THREADS settings are “export KMP_AFFINITY=compact,0,1,granularity=fine” and “export OMP_NUM_THREADS=48”. For NumPy backend, NumPy was used without BLAS or LAPACK libraries. ImageNet (1000 classes, 1.28Million images) dataset was used.

4The performance claims are based on the following data points:

Model/neon version Performance Result
DS2 run with neon v2.3.0 NumPy CPU backend (batch size = 32, sample length=30 seconds) 140 seconds/batch
DS2 run with neon v2.3.0 Intel® MKL backend (same as above) 20.5 seconds/batch
DS2 run with neon v2.2.0 NumPy CPU backend (same as above) 153 seconds/batch
DS2 run with neon v2.2.0 Intel® MKL backend (same as above) 50.4 seconds/batch
VGG-16 run with neon v2.3.0 Intel® MKL backend (batch size 64) 41.5 images/sec
VGG-16 run with neon v2.2.0 Intel® MKL backend (batch size 64) 14.1 images/sec
AlexNet run with neon v2.3.0 Intel® MKL backend (batch size 256) 550 images/sec
AlexNet run with neon v2.2.0 Intel® MKL backend (batch size 256) 250 images/sec

LEGAL Notices and Disclaimers

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.  For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice Revision #20110804

No computer system can be absolutely secure.

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.

© 2017 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Author Bio Image

Wei Wang

Deep Learning Software Engineer, Artificial Intelligence Products Group

Author Bio Image

Peng Zhang

Software Engineer, Software Service Group

Author Bio Image

Jayaram Bobba

Principal Engineer, Artificial Intelligence Products Group

Related Blog Posts

neon™ 2.6.0: Inference Optimizations for Single Shot MultiBox Detector on Intel® Xeon® Processor Architectures

We are excited to release the neon™ 2.6.0 framework, which features improvements for CPU inference path on a VGG-16 based Single Shot multibox Detector (SSD) neural network. These updates, along with the training optimizations released in neon 2.5.0, show that neon is gaining significant boosts in both training and inference performance.  (Granular configuration details, as well…

Read more

#Release Notes

Reinforcement Learning Coach v0.9

Since the release of Coach a couple of months ago, we have been working hard to push it into new frontiers that will improve its usability for real world applications. In this release, we are introducing several new features that will move Coach forward in this direction. Imitation Learning First, we added several convenient tools…

Read more

#Release Notes #Technology

BDW-SKX Normalized Throughput

neon v2.1.0: Leveraging Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…

Read more

#Release Notes