Enhancing Motor Sports Viewing Experience with Fine-Grained Object Detection

Monday night at CES, Intel announced a partnership with the Ferrari Challenge North America Series* that will enhance the race viewing experience with the help of artificial intelligence technologies from Intel.

The latest AI technologies allow enhancements to data analysis that will benefit motorsports viewers. For example, an AI model designed for fine-grained object detection and localization could give viewers the ability to select the specific cars and drivers they want to focus on, and automatically switch between cameras to keep those drivers in view. In addition, an AI solution could analyze the video feeds, detect exciting events on the track, and automatically switch feeds to highlight the action. AI-powered video analytics could also enable offline video cataloging and event statistics collection. There are many more potential use cases, and each one must be be connected with state-of-the-art AI models, designed to address the domain-specific challenges, and finally built end-to-end and optimized using AI tools.

Ferrari Challenge and Data Science

The Intel AI Lab team first developed a model to detect and classify the main objects of interest within the race broadcast — the cars— by using footage from live drone feeds. We saw this as a starting point for our AI solutions; once we could determine the relative locations of cars on the track and which cars were viewable in each of the live drone feeds, we could build additional analysis tools such as detecting and logging interesting events, such as a car passing another car.

Throughout the 2017 season of the Ferrari Challenge North America, drone footage of each race was simultaneously captured by up to 6 drones. The drone feeds provided a large quantity of video data for training and testing and also revealed the challenges specific to this use case.

Most of the state-of-the-art object localization models are developed using a public benchmark dataset, such as PASCAL VOC[1], that contains standardized images for object class recognition and localization. Although the PASCAL VOC dataset includes natural images with overlapping objects and a variety of exact shapes, sizes and viewing angles, it has 20 very visually distinct classes such as airplanes, dogs, and trains. On the other hand, our Ferrari race dataset was composed of around 50 identically shaped cars (Ferrari 458 Challenge and Ferrari 488 Challenge), only distinguishable by their exterior colors and designs. In machine learning, this is typically called fine-grained object classification and is a more challenging problem that requires extra design into the training process.

In PASCAL VOC, or any traditional car detection datasets, images are taken from the dashboard of a moving vehicle, resulting in larger profile-view and rear-view car images. Most state of the art algorithms have designed hyperparameters that have proven to be able to handle that type of visual data. However, while using the aerial drone footage from Ferrari, we faced challenges in classification due to the relatively small size of the cars in the images and the unusual viewpoint. A standard car detector would need to be re-trained to perform acceptably on this task. An example of one of the frames is pictured below.

To achieve broadcast quality driver detection on a live stream, we needed to combine novel data collection, custom data pre-processing, and careful model tuning to overcome these challenges and deliver a working model.

We decided to gather a new dataset specifically for driver detection based on the available footage. The frames were extracted from 4K drone footage of the races, and were selected to have the greatest variety in lighting, car sizes, shot angles, and track conditions. This variety partially helped to overcome the challenge of small car sizes and unusual viewpoints.

With this domain-specific training set, we also needed to design the data pre-processing steps to handle the high-resolution images. Normally, for object localization models, full-sized images are downsampled and resized to 512×512. However, for our use case, this would result in cars which were only a few pixels wide and completely indistinguishable. By breaking the image into a set of 512×512 non-overlapping patches, we were able to retain the full resolution of the smaller cars, while also increasing their relative size in the input, dramatically improving  fine-grained classification performance.

Another crucial step was data augmentation during training. The aeon[2] data loader from Intel provides many built-in optimized functionalities to perform random cropping, resizing, mirroring, and reshaping of images in the dataset as they are loaded. These augmentations critically helped with the challenge of classifying small objects and the wide variety of viewing angles.

In the neon™ deep learning framework, there are several object localization models available to developers as starting models, including the Single Shot Multibox Detector[3][5] algorithm (SSD) and Faster-RCNN[4][6].

We decided to use SSD as our starting model. Since it uses a single network for both classification and detection, it enables significantly faster inference than comparable algorithms.  Furthermore, its use of feature maps of multiple resolutions gives it much higher detection accuracy for objects of all scales, as desired for our highly dynamic drone footage.

Similar to many data science projects we have done in the past, many of the default hyperparameters from the standard models needed to be optimized with a new dataset. We have optimized over varying learning rate schedules, batch sizes, non-maximum suppression (NMS) thresholds, and patch sizes, etc..

In the end, the images below demonstrate the performance of the final model on held-out testing images. As can be seen, the model is robust to changes in lighting, angle, orientation, size, and can even classify partially-occluded cars correctly. The final annotations produced by the model include a bounding box around the car of interest, the car number, and the associated color scheme of the car.

 

Next Steps

While state of the art algorithms are a good starting point for building practical applications out of object localization networks, we also need the equally important engineering and data science work to adapt the algorithm to such a fine-grained scale and challenging dataset. Furthermore, we need tools such as aeon and neon  to experiment with ideas and refine our models. Moving forward, we are using this same workflow to improve the performance of the detection model, add more advanced event detections, enable extra driver statistics collections, and combine information from other potential data sources. Adding these functionalities was just an exciting beginning. Introducing the analytic capabilities of state of the art AI algorithms to motorsports will enhance the viewing experience in many ways and possibly uncover previously hidden insights in this amazing sport.

 

*Ferrari trademarks, the copyright in the images of the Ferrari Challenge car, and any race data are owned by Ferrari S.p.A
[1] http://host.robots.ox.ac.uk/pascal/VOC/
[2] http://aeon.nervanasys.com
[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.Reed, C. Fu, and A. Berg.  SSD: Single Shot MultiBox Detector. https://arxiv.org/abs/1512.02325, 2015 
[4] S. Ren, K. He, R. Girshick, and J. Sun.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. https://arxiv.org/abs/1506.01497, 2015
[5] https://github.com/NervanaSystems/neon/tree/master/examples/ssd
[6] https://github.com/NervanaSystems/neon/tree/master/examples/faster-rcnn