Is Space-Time Attention All You Need for Video Understanding?, Bertasius, Wang, Torresani; 2021 - Summary
author: | zhaoyue-zephyrus |
score: | 8 / 10 |
-
Naive self-attention is computationally prohibitive due to the large number of patches in the video.
This paper come up with several scalable self-attention designs over the space-time volume and empirically evluate them over large-scale video datasets.
- Joint space-time attention
- Divided space-time attention: alternating between spatial attention and temporal attention
-
Sparse Local Global attention:
-
local attention over the neighboring \(T \times H/2 \times W/2\)
-
sparse global attention over the entire clip using a stride of 2
-
-
Axial attention: attend over time, width and height dimension separately.
- Comparing computational cost between joint space-time and divided space-time.
-
Experimental results:
- ImageNet-21k pretraining is beneficial for Kinetics-400 (but not for Something-Something V2).
- TimeSformer is scalable for larger spatial resolution and longer video frames.
- TimeSformer is more efficient for training and inference (Note: this is not a fair comparison though).
- TimeSformer is effective for longer-term video classification.
TL;DR
- The first fully-transformer architecture for video recognition.
- Nice comparisons over multiple choices of computing attention more efficiently.