author: | chengchunhsu |
score: | 7 / 10 |
What is the core idea?
Core idea: transform object detection problem as a direct set prediction problem
Motivation: avoid hand-designed components in modern object detectors
- E.g., NMS and anchor generation
Solution: DEtection TRansformer (DETR)
- a set-based global loss that forces unique predictions via bipartite matching
- a transformer encoder-decoder architecture
How is it realized (technically)?
Key component:
- a set prediction loss that forces unique matching between predicted and ground truth boxes
- an architecture that predicts (in a single pass) a set of objects and models their relation
Set prediction loss
DETR infers a fixed-size set of N predictions
Goal: score predicted objects (class, position, size) with respect to the ground truth
How:
- Our loss produces an optimal bipartite matching between predicted and ground truth objects
- After that, we optimize object-specific (bounding box) losses
Match loss:
Hungarian loss:
Bounding box prediction loss:
Transformer architecture
- CNN backbone
- Generate a low-resolution feature map with 32x downsampling and 2048 channels
- Encoder-decoder transformer
- Encoder input: a flatten feature map with + positional encoding
- Decoder output: object queries
- Feed forward network (FFN)
- Independently decode the object queries into box coordinates and class labels
How well does the paper perform?
Performance on COCO
Discussion: how does DETR perform comparing to other object detection method, e.g., RetinaNet and Faster R-CNN?
Ablation studies
Discussion #1: what is the importance of each decoder layer?
Discussion #2: what is the FFN?
-
Network parameters: 41.3M → 28.7M
-
Performance: 62.4 → 60.1 (mAP@0.5)
Generalization to unseen numbers of instances
Discussion: does DETR generalize to unseen numbers of instances?
- Some classes in COCO are not well represented with many instances of the same class in the same image
- E.g., there is no image with more than 13 giraffes in the training set.
TL;DR
- Transform object detection as a direct set prediction problem
- Propose a set-based global loss that forces unique predictions via bipartite matching
- Demonstrate DETR achieves accuracy and run-time performance on par with existed method via a transformer encoder-decoder architecture