Summary

author:	chengchunhsu
score:	9 / 10

What is the core idea?

Problem: content creation needs to be controllable
Solution:
- interpret the scene in the 3D domain rather than 2D
- disentangle the scene by multiple “feature fields” (i.e., h_1 ~ h_N)

Object as Neural Feature Fields

How to represent multiple objects?

Using a separate feature filed in combination with an affine transformation for each object.
Transform points from object to scene space by:
- s: scale parameter
- t: translation parameter
- R: rotation matrix
- x: 3D points

Volume rendering

Scene composition

Summation of all feature fields

Training

Generator objectives:
- N: number of entities (i.e., objects and background)
- N_s: number of sample points along each ray
- d_k: ray for k-th pixel
- x_jk: the j-th sample point for the k-th pixel / ray
Discriminator objectives: binary cross entropy

**Disentangled Scene Generation**

Which degree our model learns to generate disentangled scene representations?
- “disentangled scene”: are objects disentangled from the background?

**Controllable Scene Generation**

Generalization Beyond Training Data

Do the compositional scene representations allow us to generalize outside the training distribution?

**Comparison to baselines**

Compositional 3D scene representation leads to more controllable image synthesis
The neural rendering pipeline brings us faster inference and more realistic images
The proposed method allows for controllable image synthesis for single-/multi-object scenes given training on raw unstructured image collections