Summary

author:	zhaoyue-zephyrus
score:	8 / 10

Naive self-attention is computationally prohibitive due to the large number of patches in the video.

This paper come up with several scalable self-attention designs over the space-time volume and empirically evluate them over large-scale video datasets.
- Joint space-time attention
- Divided space-time attention: alternating between spatial attention and temporal attention
- Sparse Local Global attention:
  - local attention over the neighboring \(T \times H/2 \times W/2\)
  - sparse global attention over the entire clip using a stride of 2
- Axial attention: attend over time, width and height dimension separately.
- Comparing computational cost between joint space-time and divided space-time.
Experimental results:
- ImageNet-21k pretraining is beneficial for Kinetics-400 (but not for Something-Something V2).
- TimeSformer is scalable for larger spatial resolution and longer video frames.
- TimeSformer is more efficient for training and inference (Note: this is not a fair comparison though).
- TimeSformer is effective for longer-term video classification.

TL;DR

The first fully-transformer architecture for video recognition.
Nice comparisons over multiple choices of computing attention more efficiently.