GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields, Niemeyer, Geiger; 2020 - Summary
author: | chengchunhsu |
score: | 9 / 10 |
What is the core idea?
- Problem: content creation needs to be controllable
- Solution:
- interpret the scene in the 3D domain rather than 2D
- disentangle the scene by multiple “feature fields” (i.e., h_1 ~ h_N)
How is it realized (technically)?
Object as Neural Feature Fields
How to represent multiple objects?
- Using a separate feature filed in combination with an affine transformation for each object.
- Transform points from object to scene space by:
- s: scale parameter
- t: translation parameter
- R: rotation matrix
- x: 3D points
Volume rendering
- The generative neural feature field (h_θ) maps (x, d, z_s, z_a) to volume density and scene features which are used to construct the scene
- σ: volume density
- f: features (for scene generation)
- x: 3D points
- d: view direction
- z_s: shape code
- z_a: appearance code
Scene composition
Summation of all feature fields
- σ: volume density
- f: features (for scene generation)
Training
- Generator objectives:
- N: number of entities (i.e., objects and background)
- N_s: number of sample points along each ray
- d_k: ray for k-th pixel
- x_jk: the j-th sample point for the k-th pixel / ray
- Discriminator objectives: binary cross entropy
How well does the paper perform?
**Disentangled Scene Generation**
- Which degree our model learns to generate disentangled scene representations?
- “disentangled scene”: are objects disentangled from the background?
**Controllable Scene Generation**
- How well the scene can be controlled?
Generalization Beyond Training Data
- Do the compositional scene representations allow us to generalize outside the training distribution?
**Comparison to baselines**
TL;DR
- Compositional 3D scene representation leads to more controllable image synthesis
- The neural rendering pipeline brings us faster inference and more realistic images
- The proposed method allows for controllable image synthesis for single-/multi-object scenes given training on raw unstructured image collections