Accelerating Deep Learning Training and Inference with System Level Optimizations
Jan 24, 2018
Jan 24, 2018
Training deep Convolutional Neural Networks (CNNs) is a demanding undertaking. Popular CNN examples such as ResNet-50*, GoogLeNet-v1*, Inception-3*, and others require the execution of hundreds of compute-intensive functions for each of hundreds of thousands of iterations.
Intel’s optimizations for popular deep learning frameworks have significantly increased processor-level performance, but there is even more we can do. In particular, system level optimizations can greatly increase the performance of CNN workloads on Intel® Xeon® and Intel® Xeon Phi™ processors used in deep learning and high performance computing applications.
These optimizations and Best Known Methods (BKMs) are the subject of our new white paper, “Boosting Deep Learning Training & Inference Performance on Intel Xeon and Xeon Phi Processor Based Platforms.” This paper was co-authored by Deepthi Karkada, Vamsi Sripathi, Dr. Kushal Datta, and Ananth Sankaranarayanan from Intel’s Artificial Intelligence Products Group. In it, we show that without a single line of code change in the framework we can further boost the performance for deep learning training and inference by up to 2X beyond the current software optimizations available for open source TensorFlow* and Caffe*.
Current deep learning frameworks like TensorFlow* and Caffe* do not take full advantage of CPU cores during the execution of CNNs. This is because the user-controllable parameters do not provide sufficient micro-architectural information on the underlying NUMA configuration to achieve optimal performance on multi-socket Intel Xeon processor-based platforms. Without the knowledge of CPU socket and NUMA configuration, simple thread affinitization (as in the case of thread pool) does not lead to optimal performance. System level optimizations are necessary to achieve the best performance for CNN workloads on CPU-based platforms.
As laid out in our BKMs, we were able to improve core utilization by partitioning the sockets and the cores on the platform as separate computing devices and using these partitions to run several separate deep learning training instances concurrently. These instances synchronously work in tandem, each on a local batch of input data. Each instance is process bound to a subset of total cores in the system using core affinity settings. We found that using core affinity and memory locality optimizations we were able improve performance by up to 2x on a single node with four workers/node relative to the current optimization using TensorFlow 1.4.0. Figure 1 shows the deep learning training performance improvements realized by using these optimizations with six deep learning benchmark topologies.
Intel® Xeon® Platinum 8168 Processor: TensorFlow* Multi-Node & Multiple Workers/Socket
Training: TensorFlow 1.4, BS=64, Image Dataset, grpc/10GB Ethernet, Parameter Server on each node
Figure 1: TensorFlow 1.4 Training Performance (Projected Time-To-Train (TTT)) Improvement with optimized affinity for cores and memory locality using 4 Workers/Node compared to current baseline with 1 Worker/Node. Platform Configuration: 2S Intel Xeon Platinum 8168 Processor @ 2.70GHz (24 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel SSD DC S3700 Series. Multiple nodes connected with 10Gbit Ethernet. Tensorflow 1.4.0, GCC 6.2.0, Intel MKL-DNN. Training measured with image data on SSD. Source: Intel internal testing, December 2017.
This same approach can be applied to deep learning inference. We created multiple independent deep learning inference framework instances and set affinity for each instance to a partitioned set of cores and memory locality on a single or multiple socket system. We found that we were able to boost deep learning inference performance by up to 2.7x with our system level optimizations, relative to the current optimization using TensorFlow 1.4. Figure 2 shows deep learning inference performance improvements found by using these system level optimizations with five deep learning benchmark topologies.
Intel® Xeon® Platinum 8168 Processor: TensorFlow* Single-Node & Multiple Workers Inference
TensorFlow 1.4, Image Dataset, forward_only
Figure 2. TensorFlow Inference Performance (Images/Sec) Improvement with optimized affinity for cores and memory locality using concurrent multiple 2, 4, & 8 Streams compared to current baseline with equivalent batch-size using 1 Stream. Platform Configuration: 2S Intel Xeon Platinum 8168 processor @ 2.70GHz (24 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel SSD DC S3700 Series. Tensorflow 1.4.0, GCC 6.2.0, Intel MKL-DNN. Inference measured with –forward_only. Image data on SSD. Source: Intel internal testing, December 2017.
Please consult our whitepaper, “Boosting Deep Learning Training & Inference Performance on Intel Xeon and Xeon Phi Processor Based Platforms,” for more information on these system level optimizations as well as code that will let you try them yourself.
Intel is committed to delivering excellent deep learning performance on the same well-known, versatile, Intel® architecture based systems that are being used for workloads like big data analytics, simulation, and modeling. We are excited to continue to improve Intel’s AI hardware and software solution portfolio to best deliver AI’s potential to all.
Notices and Disclaimers:
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.
© 2018 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
Keep tabs on all the latest news with our monthly newsletter.