neon v2.3.0: Significant Performance Boost for Deep Speech 2 and VGG models

We are excited to announce the release of neon™ 2.3.0.  It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the 

(Intel® MKL) backend over the NumPy CPU backend with neon™ 2.3.0, and more than 2X improvement2,4 over neon™ 2.2.0. For the VGG-16 model, we observed up to 2.8X3,4 performance improvement over neon™ 2.2.0.

To improve performance on DS2, we added data layout optimizations to make memory access more compatible with Intel® MKL (version:mklml_lnx_2018.0.20170908). We also added more optimized C/OpenMP* kernels replacing vanilla python implementations. These optimizations boosted neon™ 2.3.0 DS2 performance on IA by as much as 6.8X when compared to performance achieved by neon’s NumPy CPU backend (see Figure 1).

 

Figure 1: Training performance comparison of neon Deep Speech 2 with v2.2.0 and v2.3.0 for the NumPy CPU backend and the faster Intel® MKL backend running on a dual-socket Intel® Xeon® Platinum 8180 platform.

For the VGG-16 and AlexNet models, we improved training throughput by more than 2X.  We’ve added an optimization that fuses the convolution and bias layers in neon to minimize (or eliminate) costly tensor layout conversions in between layers. This promises good performance for models like VGG-16 and Alexnetboth of which have convolutional layers with bias as the basic layers. Figure 2 shows the training performance comparison of AlexNet and VGG-16 between neon v2.2 and neon v2.3 for the MKL backend on a dual-socket Intel Xeon Platinum 8168 platform.  After fusing the two layers, we maximized the time that code is executing in accelerated Intel® MKL mode to minimize the overall time it took to run the models. As a result, the performance improved up to 2.1X4 for AlexNet and up to 2.8X4 for VGG-16.

Figure 2: The training performance of AlexNet and VGG using the Intel® MKL backend of neon v2.2.0 and v2.3.0 on a dual-socket Intel® Xeon® Platinum 8168 platform. ImageNet dataset (1000 classes) was used.  

We’re continuing to improve the performance of different models, especially SSD and GAN models. Please stay tuned for improved neon performance on Intel architectures!

The authors would like to acknowledge Hanlin Tang’s help in realizing the fusion of convolution and bias across all three backends in neon, Baojun Liu’s validation effort, and Wadim Sokolowski’s help in generating the performance data for AlexNet and VGG-16.

 

Configuration details: 2S Intel® Xeon® Platinum 8180 CPU@2.5GHz (28 cores), HT enabled, turbo on, 192GB ECC RAM, CentOS Linux Release 7.3, Linux Kernel 3.10.0.514.el7.x86_64, Intel SSD 400GB. neon™ 2.3.0 (https://github.com/NervanaSystems/neon/commits/v2.3.0) commit id 9eb09d79cb65c4854db9f5d0140ad7a8e247359e and DS2 (https://github.com/NervanaSystems/deepspeech) commit id e55159d56569ff16c75369141a42fd5589d83279 were used to run. DS2 batch size 32 and speech sample length of 30 seconds were used. For Intel® MKL backend, KMP_AFFINITY and OMP_NUM_THREADS settings are “export KMP_AFFINITY=compact,0,1,verbose” and

“export OMP_NUM_THREADS=56”. For NumPy backend, NumPy was used without BLAS or LAPACK libraries. icc version 17.0.3.191 was used. LibriSpeech dataset (100 hours) was used.

Configuration details: Same as above but neon™ 2.2.0 (https://github.com/NervanaSystems/neon/commits/v2.2.0) commit id 5843e7116d880dfc59c8fb558beb58dd2ef421d0 was used.

3 Configuration details: 2S Intel® Xeon® Platinum 8168 CPU@2.7GHz (24 cores), HT enabled, turbo on, 192GB ECC RAM, OS: Ubuntu-14.04-trusty, Intel SSD 800GB. neon™ 2.3.0 (https://github.com/NervanaSystems/neon/commits/v2.3.0) commit id 9eb09d79cb65c4854db9f5d0140ad7a8e247359e and DS2 (https://github.com/NervanaSystems/deepspeech) commit id e55159d56569ff16c75369141a42fd5589d83279 were used to run. AlexNet batch size 256 and VGG-16 batch size 64 were used. For Intel® MKL backend, KMP_AFFINITY and OMP_NUM_THREADS settings are “export KMP_AFFINITY=compact,0,1,granularity=fine” and “export OMP_NUM_THREADS=48”. For NumPy backend, NumPy was used without BLAS or LAPACK libraries. ImageNet (1000 classes, 1.28Million images) dataset was used.

4The performance claims are based on the following data points:

Model/neon version Performance Result
DS2 run with neon v2.3.0 NumPy CPU backend (batch size = 32, sample length=30 seconds) 140 seconds/batch
DS2 run with neon v2.3.0 Intel® MKL backend (same as above) 20.5 seconds/batch
DS2 run with neon v2.2.0 NumPy CPU backend (same as above) 153 seconds/batch
DS2 run with neon v2.2.0 Intel® MKL backend (same as above) 50.4 seconds/batch
VGG-16 run with neon v2.3.0 Intel® MKL backend (batch size 64) 41.5 images/sec
VGG-16 run with neon v2.2.0 Intel® MKL backend (batch size 64) 14.1 images/sec
AlexNet run with neon v2.3.0 Intel® MKL backend (batch size 256) 550 images/sec
AlexNet run with neon v2.2.0 Intel® MKL backend (batch size 256) 250 images/sec

LEGAL Notices and Disclaimers

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.  For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice Revision #20110804

No computer system can be absolutely secure.

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.

© 2017 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.