Optimizing Taboola TensorFlow* Serving Application on Intel® Xeon® Scalable Processors

Publishers, marketers, and advertising agencies are increasingly using artificial intelligence applications via software-as-a-service (SaaS) cloud platforms. Intel® AI Builders member, Taboola, provides its customers with custom inferencing solutions using TensorFlow Serving (TFS)[1] framework.

Intel and Taboola have collaborated to optimize and significantly speed-up Taboola’s custom TensorFlow Serving application with the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) on Intel Xeon Scalable processors.

TFS is an open-source deployment service for machine learning models in a production environment. TFS is architected on top of the TensorFlow framework for deep learning and the workflow is a client-server model, where the server machine already has a pre-trained model and client machines send prediction requests through gRPC. Once the server receives the requests from clients, it runs a forward-pass on the available pre-trained model and returns the result.

Figure 1: TensorFlow-Serving workflow. Image courtesy https://www.tensorflow.org/serving/

In order to measure TFS performance consistently, we set up a benchmark workflow whereby 10 clients each sent 10,000 inference requests to a 2-socket system featuring Intel® Xeon® Platinum 8180 processors, and the number of recommendation requests served by the TFS server was used as a performance metric. The application was benchmarked in the following two configurations –

  • Baseline TFS: This configuration uses Eigen C++ template library [2] for tensor/matrix computations
  • Optimized TFS: This configuration uses Intel® MKL-DNN [3] for acceleration of commonly used DNN primitives. For operations not supported/available in Intel MKL-DNN, they fall back to Eigen.

When measuring the performance in the aforementioned configuration modes, we observed that the optimized TFS delivers a 1.15x speed-up over the baseline. This performance improvement is due to the acceleration provided by Intel MKL-DNN for the matrix-matrix multiplication (SGEMM) operations encountered in the application. In order to effectively use the available 56 cores from the 2-socket system, two optimized TFS instances were executed by pinning the application threads and memory requests to the respective CPU socket and NUMA domain. This technique helped improve the performance by 1.3x over the baseline TFS instance. Figure 3 at the end of this post shows the performance in each of these configuration modes.

To understand performance bottlenecks, we analyzed the application with the Intel® VTune™ tool [4] and observed that tensor broadcast operations (a tensor is an n-dimensional array and a broadcast operation involves replicating the input tensor by a specified factor on any given dimension) are the most time-consuming function in the workflow.

On modern Intel processors, it is imperative to use SIMD (Single Instruction Multiple Data) processing to achieve ideal performance. However, Intel VTune profiling revealed that the Eigen implementation of tensor broadcast operations relies heavily on scalar/non-SIMD instructions and leads to suboptimal performance. These scalar instructions, which involve division and modulo operations, are used in calculating the index in the input tensor from which the elements are then copied to the output tensor. In addition, we observed that excessive index calculations are computed when the tensor dimensions are not SIMD-friendly, i.e. not a multiple of vector register width, which is 16 for an FP32 data-type.

To maximize performance, we optimized the Eigen implementation of tensor broadcast by using Intel® Advanced Vector Extension (Intel®AVX512) SIMD instructions and reduced the number of index calculations needed to form the output tensor. To evaluate the impact of these optimizations, we benchmarked Eigen tensor broadcast operation independently of TensorFlow on a single core of the Intel processor and observed a performance speed-up of 58-65x (NxNx1 inputs) and 3-4x (1xNxN inputs) over baseline. Figure 2 shows the performance comparison for a range of tensor sizes in steps of 32 with broadcast factor of 32 on unit-dimension.

Figure 2: Performance comparison of tensor broadcast operation with and without Intel optimizations.  Lower is better.

Figure 3: Performance (throughput and speed-up) comparison of Intel-optimized TFS over the baseline.  Higher is better.

Referring back to the TFS application with which we started, using the tensor broadcast optimizations on top of Intel MKL-DNN and two TFS instances resulted in an overall performance improvement of 2.5x compared to baseline TFS. Figure 3 shows the performance impact of each of the optimization steps compared to baseline.

We generalized the tensor broadcast optimizations to N-dimensional tensors having unit size inner-/outer-most dimension and an arbitrary broadcast factor on the unit dimension, then upstreamed the code improvements to public distribution of Eigen (available in TensorFlow release 1.10).

Intel optimizations to TensorFlow-Serving deliver significant performance gains and helped Taboola reduce the latency of recommendation services on Intel Xeon Scalable processors. Ariel Pisetzky, the Vice President of Information Technology at Taboola praised the Intel optimizations to their infrastructure, stating that “Serving from the CPUs helped us reduce costs, increase efficiency, and provide more content recommendations with our existing servers.”  Intel continues to improve the performance of the deep learning software stack for infrastructure teams at companies such as Taboola, and other major customers.  We also encourage the community to introduce SIMD-friendly parameters in their machine learning models for optimal performance.

As a co-sponsor of The Artificial Intelligence Conference in San Francisco from September 4-7, we look forward to showing you the latest innovations in applied AI.  Intel keynotes and sessions will share practical AI use cases and provide the technical knowledge needed to help develop and implement successful AI applications across a variety of industries today.  Visit us at booth # 101 to see how Intel is breaking barriers between model and reality.

[1] https://www.tensorflow.org/serving/
[2] https://eigen.tuxfamily.org
[3] https://github.com/intel/mkl-dnn
[4] https://software.intel.com/en-us/intel-vtune-amplifier-xe
We acknowledge the contributions to the open source ecosystem from Huma Abidi & AG Ramesh’s team to TensorFlow and contributions from Craig Garland and Vadim Pirogov’s team to MKL-DNN
System Configuration:
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz; 2 Sockets, 56 cores/socket, Hyper-threading ON, Turbo boost OFF, CPU Scaling governor “performance”;
RAM: Samsung 192 GB DDR4@2666MHz.  (16GB DIMMS x 12);
BIOS: Intel SE5C620.86B.0X.01.0007.062120172125;
Hard Disk: INTEL SSDSC2BX01 1.5TB
OS: CentOS Linux release 7.5.1804 (Core) (3.10.0-862.9.1.el7.x86_64)
Baseline TensorFlow-Serving: TensorFlow-Serving r1.9 — https://github.com/tensorflow/serving.
Intel Optimized TensorFlow-Serving: TensorFlow-Serving r1.9 + Intel MKL-DNN + Optimizations
Intel MKL-DNN: https://mirror.bazel.build/github.com/intel/mkl-dnn/archive/0c1cf54b63732e5a723c5670f66f6dfb19b64d20.tar.gz
MKLML:  https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.15/mklml_lnx_2018.0.3.20180406.tgz
Performance results are based on testing as of (08/06/2018) and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Performance results are based on testing as of 8/06/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § For more information go to www.intel.com/benchmarks.
Intel, Xeon, and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
 *Other names and brands may be claimed as the property of others.
© Intel Corporation.