Summary | cs395T

SlowFast Networks for Video Recognition, Feichtenhofer, Fan, Malik, He; 2018 - Summary

author:	zhaoyue-zephyrus
score:	9 / 10

A Two-pathway model
- Similar to two-stream but now this one take RGB as inputs for both pathways
- Slow pathway: operates at low frame rate to capture spatial semantics
  - low frame rate: achieved by a large stride \(\tau\)
  - Degenerate temporal convolution (\(1 \times K \times K\)) at the early stages (conv1, res2, res3);
  - No temporal pooling.
- Fast pathway: operating at high frame rate, to capture motion at fine temporal resolution
  - Non-degenerate temporal convolution at all stages.
- Fast pathway is lighter-weight with reduced channel capacity (\(\frac{1}{8}\times\)) (20% FLOPs compared to Slow pathway)
- Lateral connections from Fast to Slow pathway.
  - Time-to-channel: \((\alpha T, S^2, \beta C) \rightarrow (T, S^2, \alpha\beta C)\);
  - Time-strided sampling: \((\alpha T, S^2, \beta C) \rightarrow (T, S^2, \beta C)\);
  - Time-strided convolution: \(5\times1\times1\) conv with stride of \(\alpha\).

Experimental results:
- A better training recipe enables training from scratch to be comparable to ImageNet-pretraining.
- convolutional, multi-stage fusion is better than average, late fusion
- lightweight but temporally high-resolution is important for Fast pathway.
- Good results on spatial-temporal detection (AVA) and long-term videos classification (Charades).

TL;DR

A slow-fast two-pathway architecture for video recognition.
Excellent transferring results.