Summary

Multiscale Vision Transformers, Fan, Xiong, Mangalam, Li, Yan, Malik, Feichtenhofer; 2021 - Summary

author:	zhaoyue-zephyrus
score:	9 / 10

A more efficient video transformer which leverages the fundamental architectural prior of multiscale feature hierarchies.
- Starting with input resolution and a small channel dimension, MViT hierarchically expands the channel capacity while reducing the spatial resolution.
- Resolution reduction is achieved through a speical attention which enables pooling: multi-head pool attention (MHPA)
- linear project input (\(X\)) into intermediate query/key/value:
  \[\hat{Q} = X W_Q, \hat{K} = X W_K, \hat{V} = X W_V\]
- Pooling \(\hat{Q}, \hat{K}, \hat{V}\) into \(Q, K, V\).
- The rest of attention operation follows the standard one.
- Query pooling v.s. Key-Value pooling.
  - Query pooling is only applied at the first MHPA operator to achieve the goal of decreasing resolution; degenerate stride \((1, 1, 1)\) elsewhere.
  - Key-Value pooling does not change the output sequence length, so they are used wherever query pooling is not used. Also the pooling stride for key and value pooling is identical.
- Skip connections is also pooled to adapt the dimension mismatch.
- Comparison with ViT:
Experimental results
- Better accuracy-FLOPs tradeoff over previoous video transformers.
- SoTA performance on Kinetics, SSv2, and Charades.
- Also applicable to image recognition (\(T = 1\)).

TL;DR

An efficient video transformer leveraging the fundamental architectural prior of multiscale feature hierarchies.
A smart pooling attention design for computational efficiency and downsampling.
Better accuracy-FLOPs tradeoff.