author: | zhaoyue-zephyrus |
score: | 10 / 10 |
-
Two-stream network inspired by the two-streams hypothesis in human visual cortex.
-
Spatial stream: ventral stream (which performs object recognition);
-
Temporal stream: dorsal stream (which recognizes motion);
-
Both streams are implemented by a ConvNet (ImageNet-pretrain) and combined by late fusion.
-
-
Optical Flow ConvNets
- Given a pair of video frames, optical flow is a dense vector field. The value is the displacement of each pixel.
-
stacking: a single optical flow frame is noisy.
-
concatenate along channel: \(2 \rightarrow 2\times L\)
-
optical flow stacking: sampled at the same locations across several frames
-
trajectory stacking: sampled along the motion trajectories
-
-
bi-directional: backward flow \([ \tau - L/2 , \tau ]\) + forward flow \([ \tau, \tau + L/2 ]\)
-
Implementation:
-
Pre-computed by OpenCV (TV-L1, Farneback): slow (<10 FPS)
-
Quantization: rescale to [0, 255] and compress using JPEG. (>50 compression ratio)
-
-
Architectures: the input channel of ConvNet is modified from 3 to \(2L\), the weights are averaged across channels.
-
Multi-task training: Combat overfitting because both UCF-101 and HMDB-51 are small-scale (13K / 6.8K videos).
- Two classification with CE loss simultaneously.
-
Results:
-
Pre-training is effective:
-
Stacking is import but stacking methods do not make much difference:
- Multi-task training helps for small-scale dataset (HMDB-51):
- SoTA at that time; first time deep models beat hand-crafted features on video classification.
-
TL;DR
- First deep networks for videos to beat hand-crafed features.
- A novel Two-stream architectures.
- Detailed studies on the architectural designs.