Summary | cs395T

Deformable Convolutional Networks, Dai, Qi, Xiong, Li, Zhang, Hu, Wei; 2017 - Summary

author:	nilesh2797
score:	9 / 10

What is the core idea?

Standard CNNs use fixed geometric shape: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a fixed ratio
This paper proposes Deformable Convolution Networks (DCN) which enables convolution/RoI pooling units to learn to deform the sampling locations (grid)

How is it realized (technically)?

Deformable Convolution

Composed of two convolution layers:

Regular convolution layer to generate output features but the n x n sampling grid is augmented with offsets
Offsets are obtained by applying a separate convolutional layer over the same input and is learned simultaneously

Deformable Region of Interest (`RoI`) Pooling

Similar idea as deformable convolution, offsets are now added to the spatial binning positions

Regular RoI pooling first generates the pooled feature maps, a Fully-connected layer then generates normalized offsets
Normalized offsets are scaled back to RoI’s width and height, this enables offset learning invariant to RoI size

How well does the paper perform?

Object Detection on COCO Dataset

Sample Visualization

TL;DR

Enables convolution units to learn to sample input spatial location
End to end trainable without additional supervision
Gets significant improvements over traditional CNNs on object detection task