neon™ 2.6.0: Inference Optimizations for Single Shot MultiBox Detector on Intel® Xeon® Processor Architectures

We are excited to release the neon™ 2.6.0 framework, which features improvements for CPU inference path on a VGG-16 based Single Shot multibox Detector (SSD) neural network. These updates, along with the training optimizations released in neon 2.5.0, show that neon is gaining significant boosts in both training and inference performance.  (Granular configuration details, as well as the raw data used in testing these configurations, are found at the end of the blog.)

Our focus was on a well-known hotspot (dilated convolution) in the SSD model. To accelerate convolution layer with dilation, the vanilla Python implementation was replaced with C kernels for matrix multiplication. The detection-output layer in inference path was also optimized around better use of these kernels.

All together, these improvements yielded the results shown in Figure 1 below: up to 3.7X speedup for a batch size of 8, and up to 2.4X for a batch size of 32 on a Intel® Xeon® Processor E5-2699 V4-based two socket system. The improvements were even more significant when running SSD on an Intel® Xeon® Platinum 8180 Processor-based two socket system: up to 8.4X for batch size 8, and up to 4.3X for batch size 32.

Figure 1: Training performance comparison of SSD with neon v2.4.0 and neon v2.6.0 on two Intel architectures with two batch sizes (performance obtained using data layer). Results based on internal Intel testing: December 2017. Configuration details in endnotes.

 

Table 1: Training performance (images/sec) of SSD on IA

For inference performance, the optimizations were as impressive: up to 4.4X speedup with batch size 44 on the 2S Intel Xeon E5-2699 v4 system, and up to 7.8X with batch 56 on the Intel Xeon Platinum 8180 processor-based system as shown in Figure 2.

Figure 2: Inference performance comparison of SSD with neon v2.4.0 and neon v2.6.0 on two Intel architectures (performance measured with data layer). Results based on internal Intel testing, January 2018. Configuration details in endnotes.

 

Table 2: Inference performance (images/sec) of SSD on IA

Finally, and perhaps most impressively, we observed up to 150X inference performance boost over the vanilla python NumPy CPU version of SSD on the Intel Xeon Platinum 8180 processor-based system, as shown in Table 2 above.

To conclude, we have significantly improved SSD model performance on Intel Xeon processor-based system architectures. Please stay tuned for more optimized neon models on Intel architecture.

 

Configuration details:  

Table 1/Figure 1: 2S  Intel® Xeon® Platinum 8180 CPU@2.5GHz (28 cores). OMP_NUM_THREAD was set to 56. neon 2.6.0 was tested with “export KMP_AFFINITY=verbose,granularity=fine, proclist=[0-55],explicit". neon 2.4.0 was tested without setting KMP_AFFINITY.

Table 2/Figure 2: 2S  Intel® Xeon® E5-2699 V4 CPU@2.2GHz (22 cores). OMP_NUM_THREADS was set to 44. All tests were set with “export KMP_AFFINITY=compact,granularity=fine".
Performance was measured using the Pascal VOC dataset with images of size 300 by 300.
Intel® MKL-DNN library (mklml_lnx_2018.0.1.20171227 version)

neon 2.6.0 SSD training was tested (neon master branch) with
commit d1478b7cf582bea62635e0d98c57d24585a49899

neon 2.6.0 SSD inference was tested (neon master branch) with
commit f5b612988ac93c52244cf9ab3e6ac31b47c215c7

neon 2.4.0 SSD training and inference were tested (neon master branch) with
commit ae4e9d59f8e85c5fa368ddec97bc53dfffdab75d

Notices and Disclaimers:

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”.  Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.
© 2018 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.