Summary

Long-Term Feature Banks for Detailed Video Understanding, Wu, Feichtenhofer, Fan, He, Krähenbühl, Girshick; 2018 - Summary

author:	zhaoyue-zephyrus
score:	10 / 10

Enable long-term video understanding by equipping 3D CNN with a Long-Term Feature Bank.
A Long-term Feature Bank: long-term contextual information in the whole video
- standard 3D CNN over the entire video at regularly spaced intervals (every one second)
- \(L = [L_0, \cdots, L_{T-1}]\), \(L_t \in \mathbb{R}^{N_t\times d}\), where \(N_t\) is the number of RoI features
- fixed components: trained separately on the same target dataset
Feature bank operator: FBO(\(S_t, \hat{L}_t\)), short-term feature \(S_t\) and a subset of long-term feature bank \(\hat{L}_t\) around time \(t\).
- Batch: \(\hat{L}_t = [L_{t-w}, \cdots, L_{t+w}]\)
- Causal: \(\hat{L}_t = [L_{t-2w}, \cdots, L_{t}]\)
- LFB + Non-local (attention)
- LFB + Pool: pool(\(\hat{L}_t\))
- (baseline) short-term operator: FBO(\(S_t, S_t\))
Experimental results:
- LFB is much better at capturing long-term patterns than 3D CNN.
- LFB outperforms previous methods signficantly.
- Also applicable to EPIC-Kitchens and Charades.

TL;DR