Summary

End-to-End Object Detection with Transformers, Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko; 2020 - Summary

author:	chengchunhsu
score:	7 / 10

What is the core idea?

Core idea: transform object detection problem as a direct set prediction problem

Motivation: avoid hand-designed components in modern object detectors

Solution: DEtection TRansformer (DETR)

Key component:

a set prediction loss that forces unique matching between predicted and ground truth boxes
an architecture that predicts (in a single pass) a set of objects and models their relation

Set prediction loss

DETR infers a ﬁxed-size set of N predictions

Goal: score predicted objects (class, position, size) with respect to the ground truth

How:

Our loss produces an optimal bipartite matching between predicted and ground truth objects
After that, we optimize object-speciﬁc (bounding box) losses

Match loss:

Hungarian loss:

Bounding box prediction loss:

Transformer architecture

CNN backbone
- Generate a low-resolution feature map with 32x downsampling and 2048 channels
Encoder-decoder transformer
- Encoder input: a flatten feature map with + positional encoding
- Decoder output: object queries
Feed forward network (FFN)
- Independently decode the object queries into box coordinates and class labels

Performance on COCO

Discussion: how does DETR perform comparing to other object detection method, e.g., RetinaNet and Faster R-CNN?

Ablation studies

Discussion #1: what is the importance of each decoder layer?

Discussion #2: what is the FFN?

Generalization to unseen numbers of instances

Discussion: does DETR generalize to unseen numbers of instances?

Some classes in COCO are not well represented with many instances of the same class in the same image
E.g., there is no image with more than 13 giraffes in the training set.

Transform object detection as a direct set prediction problem
Propose a set-based global loss that forces unique predictions via bipartite matching
Demonstrate DETR achieves accuracy and run-time performance on par with existed method via a transformer encoder-decoder architecture