author: | zhaoyue-zephyrus |
score: | 9 / 10 |
-
A more efficient video transformer which leverages the fundamental architectural prior of multiscale feature hierarchies.
-
Starting with input resolution and a small channel dimension, MViT hierarchically expands the channel capacity while reducing the spatial resolution.
-
Resolution reduction is achieved through a speical attention which enables pooling: multi-head pool attention (MHPA)
-
linear project input (\(X\)) into intermediate query/key/value:
\[\hat{Q} = X W_Q, \hat{K} = X W_K, \hat{V} = X W_V\] -
Pooling \(\hat{Q}, \hat{K}, \hat{V}\) into \(Q, K, V\).
-
The rest of attention operation follows the standard one.
-
Query pooling v.s. Key-Value pooling.
-
Query pooling is only applied at the first MHPA operator to achieve the goal of decreasing resolution; degenerate stride \((1, 1, 1)\) elsewhere.
-
Key-Value pooling does not change the output sequence length, so they are used wherever query pooling is not used. Also the pooling stride for key and value pooling is identical.
-
-
Skip connections is also pooled to adapt the dimension mismatch.
-
Comparison with ViT:
-
-
Experimental results
- Better accuracy-FLOPs tradeoff over previoous video transformers.
- SoTA performance on Kinetics, SSv2, and Charades.
- Also applicable to image recognition (\(T = 1\)).
TL;DR
- An efficient video transformer leveraging the fundamental architectural prior of multiscale feature hierarchies.
- A smart pooling attention design for computational efficiency and downsampling.
- Better accuracy-FLOPs tradeoff.