CLEVRER: CoLlision Events for Video REpresentation and Reasoning, Yi, Gan, Li, Kohli, Wu, Torralba, Tenenbaum; 2019 - Summary
author: | zhaoyue-zephyrus |
score: | 10 / 10 |
- CLEVRER: a diagnostic video dataset for temporal and causal reasoning under a fully controlled environment
-
descriptive: what color?
-
explanatory: what’s responsible for?
-
predictive: what’ll happen next?
-
counterfactual: what if?
-
Evaluations on baseline methods:
- Language-only models; VQA models; compositional visual reasoning models
-
Descriptive reasoning > causal reasoning
-
All baseline models lack a component to explicitly model the dynamics of the objects and the causal relations between the collision events
-
Neural-symbolic dynamic reasoning
-
Video frame parser: object-centric video representation
-
Neural dynamics predictor: a dynamics model which is able to predict the object dynamics
-
Question parser: parse input questions/choices into a sequence of program tokens
-
Program executor: assembles the modules; iterate through the program tree; final module outputs the answer
-
TL;DR
- A synthetic dataset to study temporal and causal reasoning in videos.
- Causal reasoning is challenging for current models (without extra supervision like question programs)