Summary

author:	zhaoyue-zephyrus
score:	10 / 10

CLEVRER: a diagnostic video dataset for temporal and causal reasoning under a fully controlled environment

descriptive: what color?
explanatory: what’s responsible for?
predictive: what’ll happen next?
counterfactual: what if?
Evaluations on baseline methods:
- Language-only models; VQA models; compositional visual reasoning models
- Descriptive reasoning > causal reasoning
- All baseline models lack a component to explicitly model the dynamics of the objects and the causal relations between the collision events
Neural-symbolic dynamic reasoning
- Video frame parser: object-centric video representation
- Neural dynamics predictor: a dynamics model which is able to predict the object dynamics
- Question parser: parse input questions/choices into a sequence of program tokens
- Program executor: assembles the modules; iterate through the program tree; final module outputs the answer

TL;DR

A synthetic dataset to study temporal and causal reasoning in videos.
Causal reasoning is challenging for current models (without extra supervision like question programs)