Summary

author:	zhaoyue-zephyrus
score:	10 / 10

This paper first compares some popular video architectures and proposes a new two-stream inflated 3D ConvNets.

Inflated 3D ConvNet (I3D)
- Convert 2D ConvNets for images into 3D ConvNets
- Inflating all filters and pooling kernels; \(N \times \rightarrow N \times N \times N\).
- Repeat the weights of the 2D filters \(N\) times along time dimension and rescale by \(N\).
- No temporal pooling in the first two max pooling layers (by using \(1 \times 3 \times 3\) kernels and stride 1 in time)
- Two-stream design: optical flow is also beneficial probably because optical flow algorithms are in some sense recurrent
Kinetics Human Action Video Dataset
- 400 classes; 400 or more 10-second clips per class
- Keeps growing and the newest version is 500K videos over 700 classes (\(700\times 700\)).
Experimental comparisons:
- ImageNet pretraining still helps for Kinetics:
- The contribution from flow alone: UCF101 > HMDB51 » Kinetics. (Videos have more camera motion)
- All architectures benefit from pre-training. Notably, I3D-ConvNet and 3D-ConvNet benefit most.
- SoTA results on UCF101 and HMDB51. (almost close UCF101).

TL;DR