author: | biofizzatreya |
score: | 8 / 10 |
The paper attempts to combine, convolutional networks and training principle from CNNs with vision transformers. While transformers can generalize well over large datasets, transformers themselves have certain problems. Transformers lose all positional information and since most images are locally similar, it takes more data to train vision transformers. Moreover due to quadratic complexity of the self-attention matrix, transformers have difficulty dealing with large images. LeVit’s solution to these problems is instead of feeding an image to the transformer, they feed a image after passing it through multiple convolution layers.
LeVit design principles:
- Small convnet applied to input of Transformers
- Two output heads
- Hardswish activations
- Batch-norm after every convolution
- Attention shrinking layers
LeVit is trained by a teacher network. One head performs classification with cross-entropy loss the second from a RegNetY-16GF trained on imageNet.
Ablations: To understand LeVit’s performance they performed multiple ablation studies by selectively changing different parts of the network:
- Lack of pyramidal attention - loss of accuracy
- Without patchConv - loss of accuracy
- Without batchnorm - slows down training
- Removing teacher model - slows down training
- Removal of hardswish non-linearity - degrades performance
TL;DR
- Convolutional operations applied to transformer stacks
- Self-attention shrunk for speed and accuracy
- Results in much faster object classification