author: | zhaoyue-zephyrus |
score: | 9 / 10 |
-
A Two-pathway model
-
Similar to two-stream but now this one take RGB as inputs for both pathways
-
Slow pathway: operates at low frame rate to capture spatial semantics
-
low frame rate: achieved by a large stride \(\tau\)
-
Degenerate temporal convolution (\(1 \times K \times K\)) at the early stages (conv1, res2, res3);
-
No temporal pooling.
-
-
Fast pathway: operating at high frame rate, to capture motion at fine temporal resolution
- Non-degenerate temporal convolution at all stages.
-
Fast pathway is lighter-weight with reduced channel capacity (\(\frac{1}{8}\times\)) (20% FLOPs compared to Slow pathway)
-
Lateral connections from Fast to Slow pathway.
-
Time-to-channel: \((\alpha T, S^2, \beta C) \rightarrow (T, S^2, \alpha\beta C)\);
-
Time-strided sampling: \((\alpha T, S^2, \beta C) \rightarrow (T, S^2, \beta C)\);
-
Time-strided convolution: \(5\times1\times1\) conv with stride of \(\alpha\).
-
-
-
Experimental results:
-
A better training recipe enables training from scratch to be comparable to ImageNet-pretraining.
-
convolutional, multi-stage fusion is better than average, late fusion
- lightweight but temporally high-resolution is important for Fast pathway.
- Good results on spatial-temporal detection (AVA) and long-term videos classification (Charades).
-
TL;DR
- A slow-fast two-pathway architecture for video recognition.
- Excellent transferring results.