Summary | cs395T

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, Lei, Li, Zhou, Gan, Berg, Bansal, Liu; 2021 - Summary

author:	zhaoyue-zephyrus
score:	8 / 10

Training: Sparsely sample \(N\) clips from the video. Each clip is concatenated with text input to produce a prediction.
- Independent predictions from all sampled clips are aggregated to derive a consensus.
Testing: Uniformly sample multiple clips and aggregate the predictions.

Image-Text Pre-training: Since 2D CNN is used as visual encoder, ClipBERT uses large-scale image-text datasets (COCO Captions + VisualGenome Captions) for pre-training.
Experimental Results:
- End-to-end training helps
- SoTA results on text-to-video retrieval and VQA.

TL;DR

Sparse sampling enables end-to-end video-to-text modeling.
Generalizable to videos with different durations (3-sec to 180-sec).
Extensive ablation studies.