author: | zhaoyue-zephyrus |
score: | 10 / 10 |
-
Enable long-term video understanding by equipping 3D CNN with a Long-Term Feature Bank.
-
A Long-term Feature Bank: long-term contextual information in the whole video
-
standard 3D CNN over the entire video at regularly spaced intervals (every one second)
-
\(L = [L_0, \cdots, L_{T-1}]\), \(L_t \in \mathbb{R}^{N_t\times d}\), where \(N_t\) is the number of RoI features
-
fixed components: trained separately on the same target dataset
-
-
Feature bank operator: FBO(\(S_t, \hat{L}_t\)), short-term feature \(S_t\) and a subset of long-term feature bank \(\hat{L}_t\) around time \(t\).
-
Batch: \(\hat{L}_t = [L_{t-w}, \cdots, L_{t+w}]\)
-
Causal: \(\hat{L}_t = [L_{t-2w}, \cdots, L_{t}]\)
-
LFB + Non-local (attention)
-
LFB + Pool: pool(\(\hat{L}_t\))
-
(baseline) short-term operator: FBO(\(S_t, S_t\))
-
-
Experimental results:
- LFB is much better at capturing long-term patterns than 3D CNN.
- LFB outperforms previous methods signficantly.
- Also applicable to EPIC-Kitchens and Charades.
TL;DR
- A memory bank design for long-term video understanding.
- Significant improvement on spatial-temporal action detection.