When executing inference operations, AI practitioners need an efficient way to integrate components that delivers great performance at scale while providing a simple interface between application and execution engine.
Thus far, TensorFlow* Serving has been the serving system of choice for several reasons:
There are, however, a few challenges to the successful adoption of TensorFlow Serving:
For successful adoption, an inference platform should include acceptable latency even for demanding workloads, easy integration between training and deployment systems, scalability and a standard client interface. A new model server inference platform developed by Intel, the OpenVINO™ Model Server, offers the same interface as TensorFlow Serving gRPC API but employs inference engine libraries from the Intel® Distribution of OpenVINO™ toolkit. Based on convolutional neural networks (CNN), this toolkit extends workloads across Intel® hardware (including accelerators) and maximizes performance across computer vision accelerators—CPUs, integrated GPUs, Intel Movidius VPUs, and Intel FPGAs
Intel’s Poland-based AI inference platform team compared results captured from a gRPC client run against a Docker container using a TensorFlow* Serving image from Docker Hub (tensorflow/serving:1.10.1) and a Docker image built using an OpenVINO Model Server image with Intel Distribution for OpenVINO toolkit version 2018.3. We applied standard models from TensorFlow-Slim image classification models library, specifically resnet_v1_50, resnet_v2_50, resnet_v1_152 and resnet_v2_152.
Using identical client application code and hardware configuration in the Docker containers, OpenVINO Model Server delivered up to 5x the performance of TensorFlow Serving, depending on batch size. The improved performance of the OpenVINO Model Server means that the inference interface can be easily accessible over a network, opening new opportunities for supported applications and reducing the cost, latency and power consumption.
Models trained in TensorFlow, MxNet*, Caffe*, Kaldi*, or in ONNX format are optimized using the Model Optimizer included in the OpenVINO toolkit. This process is done just once. The output of the model optimizer is two files with .xml and .bin extensions. The XML represents the optimized graph, and the bin file contains the weights. These files are loaded into the Inference Engine, which provides a lightweight API for integration into the actual runtime application.OpenVINO Model Server allows these models to be served through the same gRPC interface as TensorFlow Serving.
An automated pipeline can be easily implemented, which first trains the models in the TensorFlow framework, then exports the results in a protocol buffer file and later converts them to Intermediate Representation format. As long the model includes layer types supported by OpenVINO, there are no extra steps needed. However, for a few non-supported layers there is still a way to complete the transformation by installing appropriate extensions for the missing operations. Refer to the Model Optimizer documentation for more details.
The same conversion can be completed for Caffe and MXNet models (and the recent OpenVINO release 2018.3 also supports Kaldi and ONNX models.) As a result, the OpenVINO Model Server can become the inference execution component for all these deep learning frameworks.
The OpenVINO Model Server architecture stack is shown in Figure 5. It is implemented as a Python* service with gRPC libraries exposing the API from the TensorFlow Serving API. These are used as identical proto files, which make the API implementation fully compatible for the same clients. Therefore, no code changes are needed on the client side to connect to both serving components.
Key differences include inference execution implementation which relies on the Inference Engine API. With an optimized model format, and using Intel-optimized libraries for inference execution on CPUs, FPGAs, and VPUs, you can take advantage of significantly better performance.
The first step in enabling OpenVINO Model Server is to generate an IR model format out of the TensorFlow saved model representation using the Model Optimizer. You can use a command similar to the example below:
mo_tf.py --saved_model_dir /tf_models/resnet_v1_50 --output_dir /ir_models/resnet_v1_50/ --model_name resnet_v1_50
Model Optimizer arguments:
- Path to the Input Model: None
- Path for generated IR: /ir_models/resnet_v1_50/
- IR output name: resnet_v1_50
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: Not specified, inherited from the model
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- Enable grouped convolutions fusing: True
- Move mean values to preprocess section: False
- Reverse input channels: False
TensorFlow specific parameters:
- Input model in text protobuf format: False
- Offload unsupported operations: False
- Path to model dump for TensorBoard: None
- Update the configuration file with input/output node names:
- Operations to offload: None
- Patterns to offload: None
- Use the config file: None
Model Optimizer version: 126.96.36.19935e231
[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: /ir_models/resnet_v1_50/resnet_v1_50.xml
[ SUCCESS ] BIN file: /ir_models/resnet_v1_50/resnet_v1_50.bin
[ SUCCESS ] Total execution time: 7.75 seconds.
Before the models can be used in OpenVINO Model Server they should be placed in a folder structure similar to the one shown in Figure 6.
Each model should have a separate folder where every version is stored in subfolders with numerical names. This way, OpenVINO Model Server can handle multiple models and manage their versions in a similar manner to TensorFlow Serving.
The deployment process is limited to two steps:
docker build -t openvino-model-server:latest
docker run --rm -d -v /models/:/opt/ml:ro -p 9001:9001 openvino-model-server:latest \
/ie-serving-py/start_server.sh ie_serving model --model_path /opt/ml/model1 --model_name my_model --port 9001
After those two steps are finished, OpenVINO Model Server runs as a Docker container in detached mode and listens for gRPC inference requests on port 9001. More details about the usage and configuration is included in the GitHub repository documentation.
Below are examples of OpenVINO Model Server adoptions. More details can be found in the github repository.
The simplest usage example for the OpenVINO Model Server is with a Docker container running on a single machine.
The prerequisites for the setup are:
While the docker container is configured and launched according to this documentation, it can be used to serve inference interface for local applications or over the network.
AWS Sagemaker* is an inference solution with REST API interface and capabilities for configuring custom pre and post processing. It passes inference requests to the TensorFlow Serving component which is installed in the same docker container along with Sagemaker services.
It is possible to replace the TensorFlow Serving component with OpenVINO Model Server without changing the client code or the Sagemaker component. The replacement is mostly transparent and the only needed modifications are in an updated Docker file for building the Sagemaker component and a minor code update that injects a command for starting OpenVINO Model Server instead of TensorFlow Serving. The example code is present in the OpenVINO Model Server source code repository.
You can take advantage of the capabilities of AWS Sagemaker while improving the performance and reducing the response latency.
OpenVINO Model Server can be easily deployed in a Kubernetes* environment. Such a configuration enables new capabilities due to its scalability and high availability. Each instance of OpenVINO Model Server is represented by a Kubernetes pod attached to a service exposed via nginx ingress.
Inference operations are stateless, which makes the infrastructure easy to scale horizontally up and down according to user demand. AI models need to be mounted inside the volumes stored in NAS solutions like NFS, CEPH, S3 or others supported by Kubernetes.
To summarize, OpenVINO Model Server has multiple benefits for data scientists and inference consumers:
The Intel AI inference platform team would like to thank the GE Healthcare team for their collaboration in designing and testing OpenVINO Model Server integration with AWS Sagemaker and providing valuable feedback about performance results. We would also like to thank Prashant Shah and the Intel AI business development team for their collaboration and partnership, the Intel AI benchmarking team in Poland for help in executing performance tests on multiple configurations and hardware, and the Intel OpenVINO development team in Nizhny for assistance and numerous consultations. These contributions enabled Marek Lewandowski and the AIPG inference platform team to design and develop OpenVINO Model Server in a short timeframe.