Summary

Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan, Zisserman; 2014 - Summary

author:	zhaoyue-zephyrus
score:	10 / 10

Two-stream network inspired by the two-streams hypothesis in human visual cortex.
- Spatial stream: ventral stream (which performs object recognition);
- Temporal stream: dorsal stream (which recognizes motion);
- Both streams are implemented by a ConvNet (ImageNet-pretrain) and combined by late fusion.
Optical Flow ConvNets
- Given a pair of video frames, optical flow is a dense vector field. The value is the displacement of each pixel.
- stacking: a single optical flow frame is noisy.
  - concatenate along channel: \(2 \rightarrow 2\times L\)
  - optical flow stacking: sampled at the same locations across several frames
  - trajectory stacking: sampled along the motion trajectories
- bi-directional: backward flow \([ \tau - L/2 , \tau ]\) + forward flow \([ \tau, \tau + L/2 ]\)
- Implementation:
  - Pre-computed by OpenCV (TV-L1, Farneback): slow (<10 FPS)
  - Quantization: rescale to [0, 255] and compress using JPEG. (>50 compression ratio)
- Architectures: the input channel of ConvNet is modified from 3 to \(2L\), the weights are averaged across channels.
Multi-task training: Combat overfitting because both UCF-101 and HMDB-51 are small-scale (13K / 6.8K videos).
- Two classification with CE loss simultaneously.
Results:
- Pre-training is effective:
- Stacking is import but stacking methods do not make much difference:
- Multi-task training helps for small-scale dataset (HMDB-51):
- SoTA at that time; first time deep models beat hand-crafted features on video classification.

TL;DR

First deep networks for videos to beat hand-crafed features.
A novel Two-stream architectures.
Detailed studies on the architectural designs.