Summary

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, Graham, El-Nouby, Touvron, Stock, Joulin, Jégou, Douze; 2021 - Summary

author:	biofizzatreya
score:	8 / 10

The paper attempts to combine, convolutional networks and training principle from CNNs with vision transformers. While transformers can generalize well over large datasets, transformers themselves have certain problems. Transformers lose all positional information and since most images are locally similar, it takes more data to train vision transformers. Moreover due to quadratic complexity of the self-attention matrix, transformers have difficulty dealing with large images. LeVit’s solution to these problems is instead of feeding an image to the transformer, they feed a image after passing it through multiple convolution layers.

LeVit design principles:

Small convnet applied to input of Transformers
Two output heads
Hardswish activations
Batch-norm after every convolution
Attention shrinking layers

LeVit is trained by a teacher network. One head performs classification with cross-entropy loss the second from a RegNetY-16GF trained on imageNet.

Ablations: To understand LeVit’s performance they performed multiple ablation studies by selectively changing different parts of the network:

Lack of pyramidal attention - loss of accuracy
Without patchConv - loss of accuracy
Without batchnorm - slows down training
Removing teacher model - slows down training
Removal of hardswish non-linearity - degrades performance

TL;DR

Convolutional operations applied to transformer stacks
Self-attention shrunk for speed and accuracy
Results in much faster object classification