This article introduces software and deep neural network architecture (DNN) level optimizations and tweaks to achieve high throughput with deep learning based object detection applications and FPGAs on edge platforms. The ideas presented here could be generalized to speed up other compute-intensive applications on edge platforms as well.
- Overview of Edge Platforms
- Introduction to Object-Detection using Deep Learning
- Edge Specialized Hardware Solutions and FPGA
- Solution Optimization/Tuning Techniques
1. Overview of the Edge Environment
What is ‘Edge’?
Edge devices sit on the fringes (edge) of the cloud and provide a gateway into the cloud service for the data. They include platforms sitting ‘on premise’ and connected directly to the source of data (usually some kind of sensors) on one side and to a cloud service on the other. More than relative location in the cloud infrastructure, edge compute platforms serve a more important purpose of running local analytics on the input data and act upon the results themselves before optionally forwarding the collected data to the cloud service.
Given their specialized function of analyzing data, potentially in real time, they are usually most successful when fitted with specialized hardware for efficient performance. Given their close proximity to the data source, they also find themselves having to work with a restricted power-supply and thermal envelope.
Why do we need Edge platforms?
Edge platforms prove to be extremely valuable in the following situations:
- Slow or unreliable connectivity to cloud: certain situations require real-time analysis and corresponding responses to sensor data. Having a slow or unreliable network connectivity would severely impede proper functionality of such systems.
For example, autonomous driving requires real-time responses to the sensor data indicating a hazardous situation. Waiting for the sensor data to reach the cloud for analysis and the subsequent decisions to be delivered back could prove to be too disastrous.
- Too much data is collected: even in situations that may not need strict real-time responses, sending huge quantities of data to the cloud may be impractical or too expensive. For example, a system of surveillance cameras could collect a huge amount of raw image frames. Expecting to pump all of this data to the cloud could be impractical or wasteful if only a few of these frames would have any objects of interest.
- Privacy and security of customer data: certain data collected might be highly secretive and its confidentiality would be essential to the customer’s business. In such cases, being able to avoid transmission of any such data over the internet would greatly decrease the attack surface thereby providing better security.
2. Introduction to Object Detection
Multi-class Object Detection
We present a particular case study of an application tasked with video surveillance and analytics on the edge. The application would need to parse incoming raw video frames from surveillance cameras connected to it and be able to identify and locate objects of interest in them.
Commonly used in digital surveillance, security, and autonomous Driving applications. Published techniques to solve this problem already exist such as Fast-RCNN (Girshick ), YOLO (Redmon, et. al. ), SSD (Liu et. al. ), etc.
Image with bounding boxes. Liu et. al 
This problem of detecting object classes within an image is not new and has already been solved by the computer vision and machine learning community. However, being deployable as a real-time edge solution presents many constraints:
- Equal to or better than real-time performance requirement: be able to process video streams from multiple surveillance cameras. Hence analytics throughput must be equal to the sum of the individual camera’s throughputs.
- Low power consumption
- Variable input resolutions: the system must maintain fair accuracy across a range of input camera frame resolutions.
- Maintain a ‘reasonable’ accuracy: cannot sacrifice accuracy for performance.
- Deployable on low-memory platforms: Edge platforms often have low memory to meet form-factor, power, and thermal requirements. The solution must work fairly well with such platforms.
- The solution must be easy to adapt to an arbitrary number of detection classes
In summary, there should be a fair balance between performance, accuracy, memory and power consumption.
3. Edge Specialized Hardware Solutions and FPGA
The FPGA Technology
- Field Programmable Gate Array
- Comprised of digital logic elements that can be (re)programmed ‘on the field’
- Relatively easy to design complex digital logic
- Extensive selection of options for design
- From Hardware Description Languages to High-level languages
Intel® Arria® 10 FPGA
Intel® Arria® 10 FPGA is compatible with OpenCL* interface. The logic of the FPGA can be specified as OpenCL* kernels and the application kernels can be deployed and controlled during runtime from a host CPU using OpenCL* calls. The figure below shows the control flow of provisioning and running applications on the Intel® Arria® 10 FPGA.
*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
4. Optimization and Tuning Techniques
Software stack optimized to minimize memory transfers
- Intel® Deep Learning SDK is a design stack for running CNN inferences on Intel FPGAs.
- Intel® Deep Learning SDK architecture caches input and output feature maps and convolution filters inside the on-chip memory.
- Defines a double-buffered Stream buffer architecture used to feed feature map elements to Processing elements.
- Processing elements process vectorized chunks of feature map data to speed up computation.
DLA Stream buffer architecture. Aydonat et. al. 
Use smaller Squeezenet replacing the VGG16 as the Classifier backbone
- VGG16 is the backend classifier used in SSD300.
- Replace VGG16  with the faster Squeezenet  as the backend classfier.
- Replace convolution layers in SSD with 7 Fire layers similar to ones in Squeezenet.
- Overall size reduction from ~103 MB to ~25MB
Additional SSD layers shown with a VGG-16 backend. Liu et. al. 
Reduce input resolution and use bypassing mbox layers earlier
- Change input data resolution from 300x300x3 to 224x224x3
- Following this change, accuracy loss needs to be limited by connecting passing on earlier feature maps to Detection layers
- This is achieved using ResNet  like connections to bypass some intermediary layers that allow earlier feature maps to be concatenated to later ones
- Fairly retains accuracy in some cases without performance hit.
- Using bypass connections addresses vanishing gradient issues occurring from extra Fire layers in detector.
Reduce intermediate feature map sizes to ‘fit’ into Stream Buffers
- Given that the Streambuffer is used to cache the feature maps, for performance reasons it is imperative that the input and output feature maps wholly fit inside the Streambuffer.
- One of the feature map early-on in the network exceeded the Streambuffer size. This caused degraded performance due to external memory accesses to fetch un-cached data during the convolution execution.
- Hence, those particular feature maps were trimmed accordingly using:
- Larger convolution filter with larger strides
- Fewer of those larger convolution filters
- This size reduction only caused minimal accuracy loss achieved in some cases, although heavily case dependent
- This change could also affect training time
Augment with an efficient media pipeline
- Edge inference often relies on data captured and pre-processed in real-time
- ‘Pipelining’ data processing stages with inference alleviates inference performance restrictions to accommodate better accuracy
- Hardware accelerated media processing is particularly beneficial
- Solutions like Intel® Media SDK provide hardware accelerated media processing implementations.
5. Summary and Key Takeaways
- Intelligent Edge analytics vital to increasingly smart, connected world.
- Edge needs software and topology optimizations custom tailored for specialized HW
- The versatility of FPGAs make them excellent Edge platform components
- Tuning the complete Hardware and Software stack for the use-case is vital for performance.
- Pre-optimized components greatly simplify this
Links and References
1] Bai, Yu & Alawad, Mohammed & F. DeMara, Ronald & Lin, Mingjie. (2015). Optimally Fortifying Logic Reliability through Criticality Ranking. Electronics. 4. 150-172. 10.3390/electronics4010150.
 Girshick, Ross. Fast-RCNN. arXiv:1504.08083 [cs.CV]
 Redmon, Joseph & Divvala, Santosh & Girshick, Ross & Farhadi, Ali. You Only Look Once: Unified, Real-Tiem Object Detection. arXiv:1506.02640 [cs.CV]
 Liu et. al. SSD: Single Shot MultiBox Detector. arXiv:1512.02325 [cs.CV]
 Aydonat et. al. An OpenCL(TM) Deep Learning Accelerator on Arria 10. arXiv:1701.03534 [cs.DC]
 Simonyan et. al. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]
 Iandola et. al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv:1602.07360 [cs.CV]
 He et. al. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
Intel, the Intel logo, Intel Inside, the Intel Inside logo and Arria are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.