Still not slowing down: Benchmarking optimized Winograd implementations

Jul 25, 2016

By: Scott Gray and Urs Köster

This is part 3 of a series of posts on using the Winograd algorithm to make convolutional networks faster than ever before. In the second part we provided a  technical overview of how the algorithm works. Since the first set of Winograd kernels in neon, which we described in the first part of this series, the cuDNN team has independently implemented their own Winograd kernels for the 2×2 and 4×4 tile sizes. Here we compare the two implementations of these algorithms, and describe how we optimized specifically for the types for layers encountered in different networks.


All benchmarks were performed using a TitanX GPU with a fixed clock rate of 1GHz for a fair comparison. The benchmarks compare neon 1.5 against cuDNN v5.1. In neon, the faster of the 2×2 and 4×4 tile size Winograd algorithm is used, and for cuDNN we used cudnnFindConvolution to select the fastest algorithm. All benchmarks are using fp32 data format, and results in fp16 are very similar.

To measure performance we use units of algorithmic speedup.  A value of one represents the maximum possible speed of a conventional direct convolution approach, running at full utilization. Values above this indicate how efficiently a faster algorithm is implemented. The TitanX has 3072 CUDA cores for a performance of 6144 GFLOPS, so e.g. an algorithmic speedup of 2x indicates an effective performance of 12,288 GFLOPS on convolutions.

The benchmarks below focus on the 3×3 layers of the ImageNet models.  The only exception is that we exclude the first layer of VGG, which is usually faster to compute with direct convolution.  Independent benchmarks for entire networks are available at Soumith’s Convnet benchmarks.


The 19 layer VGG model is the widest convolutional layers of the models compared here and yields the largest Winograd speedup.  Across the board, our Winograd implementation (blue) is significantly faster than the cuDNN implementations (green). The high performance is possible because external transforms are very well amortized by the time spent in the batched GEMM operation. Since this network consists almost exclusively of 3×3 layers, the speedup is very close to the end-to-end training speed for the full network.

fig1_vggFigure 1: Speed benchmarks for a 19 layer VGG model. Nervana Winograd is displayed in blue. cuDNN is displayed in green. Larger is better. Nervana is consistently faster than cuDNN for all batch sizes.

Deep Residual Network

The final network is the current ImageNet winner by the MSRA team, the 152 layers residual network. We are benchmarking the 3×3 stride 1 layers, where our implementation again outperforms cuDNN by a factor of 2. Note that this network has 36 14×14 layers which could be replaced by 16×16 without much changing the speed of the network.


Figure 2: Speed benchmarks for MSRA model. Nervana Winograd is displayed in blue. cuDNN is displayed in green. Larger is better. Nervana is considerably faster than cuDNN, particularly at smaller batch sizes.

Performance of fprop, bprop and update

Here we choose one network (VGG) and break down the performance by the three convolution operations: fprop, bprop and update.  Fprop and bprop use the same underlying kernel, so performance is generally the same.  Performance only diverges when the input and output feature map depths are unequal.  The update operation has to transform two large tensors, the image and the delta, making it harder to optimize.  The transform costs are higher, and harder to amortize.  Furthermore, when decreasing the mini-batch size, performance holds steady, likely because the super-tile overlap doesn’t have to be computed in the batched GEMM.

For cuDNN, the performance for update is much less than that of fprop and bprop.  Nervana update transforms have integrated the transpose operation, which is required in addition to the transform for update, so there is no additional overhead. Furthermore, all Winograd external transforms run at full device memory bandwidth.

In cuDNN, the update operation is relatively slow even on larger batch sizes, as the underlying GEMM implementation is not well optimized for the size of batched GEMM required. In contrast, Nervana’s implementation is very efficient at performing the the small tile sizes operations. Table 1 shows a detailed timing breakdown.


Figure 3: Detailed speed benchmarks for fprop, bprop and update on the VGG model. Nervana in blue, cuDNN in green.

Operation cuDNN Nervana
Input (data) transform 18.7ms 17.9ms
Filter (delta) transform 20.4ms 17.7ms
GEMM 424.7ms 147.7ms
Output Transform 1.9ms 0ms (fused)


Table 1: Detailed timing example: this table shows the breakdown for the VGG layer 4.2 update operation. Timings are for 10 calls.

Workspace Size for External Transforms

The extra scratch space required for storing transform outputs can be computed in a similar way to algorithmic speedup. The input transform replaces each block of 4×4 by a 6×6 block, for a memory increase of 2.25.  The filter transform takes the 3×3 points of the filter and increases it to 6×6 for a net increase of 4x.  However, the filters are typically much smaller than the image and delta tensors, and amount to rounding error in workspace size.  The update operation needs 2 image-sized tensors for a total of 4.5x.  As this is the most that will be encountered, total workspace is 4.5x the size of the largest network layer.  This typically amounts to much less than what cuDNN FFT requires.  Still it pays to be mindful of layers with very large feature maps like the first few layers in VGG. Here the image and delta sizes can be enormous (224*224*64*64*4=800MB), which can necessitate using a fused 2×2 transform or direct convolution instead.  As modern network architectures tend to use striding on the first layer, workspace requirements tend to be more moderate.


As of the neon 1.3 release, Winogroad kernels have been in daily use for training networks in neon and the Nervana Cloud. Automatic tuning is used to select the optimum kernel, either direct convolution, 2×2 or 4×4 Winograd, for each layer. The kernels are deterministic and numerically accurate. Nervana is continuing to push the envelope and steadily adding improvements for small mini-batches and more filter sizes, helping the community to build the fastest performing deep learning models.

To learn more request more information about Nervana, neon, or the Nervana Cloud, please contact us at


Related Blog Posts

neon™ 2.6.0: Inference Optimizations for Single Shot MultiBox Detector on Intel® Xeon® Processor Architectures

We are excited to release the neon™ 2.6.0 framework, which features improvements for CPU inference path on a VGG-16 based Single Shot multibox Detector (SSD) neural network. These updates, along with the training optimizations released in neon 2.5.0, show that neon is gaining significant boosts in both training and inference performance.  (Granular configuration details, as well…

Read more

#Release Notes

Reinforcement Learning Coach v0.9

Since the release of Coach a couple of months ago, we have been working hard to push it into new frontiers that will improve its usability for real world applications. In this release, we are introducing several new features that will move Coach forward in this direction. Imitation Learning First, we added several convenient tools…

Read more

#Release Notes #Technology

neon v2.3.0: Significant Performance Boost for Deep Speech 2 and VGG models

We are excited to announce the release of neon™ 2.3.0.  It ships with significant performance improvements for Deep Speech 2 (DS2) and VGG models running on Intel® architecture (IA). For the DS2 model, our tests show up to 6.8X improvement1,4 with the  (Intel® MKL) backend over the NumPy CPU backend with neon™ 2.3.0, and more…

Read more

#Release Notes