Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling, Lei, Li, Zhou, Gan, Berg, Bansal, Liu; 2021 - Summary
author: | zhaoyue-zephyrus |
score: | 8 / 10 |
-
Training: Sparsely sample \(N\) clips from the video. Each clip is concatenated with text input to produce a prediction.
- Independent predictions from all sampled clips are aggregated to derive a consensus.
-
Testing: Uniformly sample multiple clips and aggregate the predictions.
-
Image-Text Pre-training: Since 2D CNN is used as visual encoder, ClipBERT uses large-scale image-text datasets (COCO Captions + VisualGenome Captions) for pre-training.
-
Experimental Results:
- End-to-end training helps
- SoTA results on text-to-video retrieval and VQA.
TL;DR
- Sparse sampling enables end-to-end video-to-text modeling.
- Generalizable to videos with different durations (3-sec to 180-sec).
- Extensive ablation studies.