neon™ 2.6.0: Inference Optimizations for Single Shot MultiBox Detector on Intel® Xeon® Processor Architectures
Jan 24, 2018
Jan 24, 2018
We are excited to release the neon™ 2.6.0 framework, which features improvements for CPU inference path on a VGG-16 based Single Shot multibox Detector (SSD) neural network. These updates, along with the training optimizations released in neon 2.5.0, show that neon is gaining significant boosts in both training and inference performance. (Granular configuration details, as well as the raw data used in testing these configurations, are found at the end of the blog.)
Our focus was on a well-known hotspot (dilated convolution) in the SSD model. To accelerate convolution layer with dilation, the vanilla Python implementation was replaced with C kernels for matrix multiplication. The detection-output layer in inference path was also optimized around better use of these kernels.
All together, these improvements yielded the results shown in Figure 1 below: up to 3.7X speedup for a batch size of 8, and up to 2.4X for a batch size of 32 on a Intel® Xeon® Processor E5-2699 V4-based two socket system. The improvements were even more significant when running SSD on an Intel® Xeon® Platinum 8180 Processor-based two socket system: up to 8.4X for batch size 8, and up to 4.3X for batch size 32.
Figure 1: Training performance comparison of SSD with neon v2.4.0 and neon v2.6.0 on two Intel architectures with two batch sizes (performance obtained using data layer). Results based on internal Intel testing: December 2017. Configuration details in endnotes.
Table 1: Training performance (images/sec) of SSD on IA
For inference performance, the optimizations were as impressive: up to 4.4X speedup with batch size 44 on the 2S Intel Xeon E5-2699 v4 system, and up to 7.8X with batch 56 on the Intel Xeon Platinum 8180 processor-based system as shown in Figure 2.
Figure 2: Inference performance comparison of SSD with neon v2.4.0 and neon v2.6.0 on two Intel architectures (performance measured with data layer). Results based on internal Intel testing, January 2018. Configuration details in endnotes.
Table 2: Inference performance (images/sec) of SSD on IA
Finally, and perhaps most impressively, we observed up to 150X inference performance boost over the vanilla python NumPy CPU version of SSD on the Intel Xeon Platinum 8180 processor-based system, as shown in Table 2 above.
To conclude, we have significantly improved SSD model performance on Intel Xeon processor-based system architectures. Please stay tuned for more optimized neon models on Intel architecture.
Table 1/Figure 1: 2S Intel® Xeon® Platinum 8180 CPU@2.5GHz (28 cores).
OMP_NUM_THREAD was set to 56. neon 2.6.0 was tested with “
export KMP_AFFINITY=verbose,granularity=fine, proclist=[0-55],explicit". neon 2.4.0 was tested without setting
Table 2/Figure 2: 2S Intel® Xeon® E5-2699 V4 CPU@2.2GHz (22 cores).
OMP_NUM_THREADS was set to 44. All tests were set with “
Performance was measured using the Pascal VOC dataset with images of size 300 by 300.
Intel® MKL-DNN library (mklml_lnx_2018.0.1.20171227 version)
neon 2.6.0 SSD training was tested (neon master branch) with
neon 2.6.0 SSD inference was tested (neon master branch) with
neon 2.4.0 SSD training and inference were tested (neon master branch) with
Notices and Disclaimers:
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.
© 2018 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
Since the release of Coach a couple of months ago, we have been working hard to push it into new frontiers that will improve its usability for real world applications. In this release, we are introducing several new features that will move Coach forward in this direction. Imitation Learning First, we added several convenient tools…
We are excited to announce the release of neon™ 2.3.0. It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the (Intel® MKL) backend over the NumPy CPU backend with neon™ 2.3.0, and more…
We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…
Get the latest from Intel AI