author: | zhaoyue-zephyrus |
score: | 10 / 10 |
- This paper first compares some popular video architectures and proposes a new two-stream inflated 3D ConvNets.
-
Inflated 3D ConvNet (I3D)
-
Convert 2D ConvNets for images into 3D ConvNets
-
Inflating all filters and pooling kernels; \(N \times \rightarrow N \times N \times N\).
-
Repeat the weights of the 2D filters \(N\) times along time dimension and rescale by \(N\).
-
No temporal pooling in the first two max pooling layers (by using \(1 \times 3 \times 3\) kernels and stride 1 in time)
-
Two-stream design: optical flow is also beneficial probably because optical flow algorithms are in some sense recurrent
-
-
Kinetics Human Action Video Dataset
-
400 classes; 400 or more 10-second clips per class
-
Keeps growing and the newest version is 500K videos over 700 classes (\(700\times 700\)).
-
-
Experimental comparisons:
- ImageNet pretraining still helps for Kinetics:
-
The contribution from flow alone: UCF101 > HMDB51 » Kinetics. (Videos have more camera motion)
-
All architectures benefit from pre-training. Notably, I3D-ConvNet and 3D-ConvNet benefit most.
- SoTA results on UCF101 and HMDB51. (almost close UCF101).
TL;DR
- A large-scale video dataset.
- A novel Inflated 3D network.
- Excellent transferring results.