Summary

Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan, Zisserman; 2014 - Summary

author:	zhaoyue-zephyrus
score:	10 / 10

The main idea of this paper is to increase the network depth by using 3x3 convolution layers from beginning to the end.

Different from Alexnet, which uses 11x11 conv with stride 4, or ZF-Net, which uses 7x7 conv with stride 2, the network uses 3x3 conv with stride 1.
- The receptive field of 7x7 conv is the same as the effective receptive field of three layers of 3x3 conv but the non-linearity (ReLU) is three times, making the decision function more discriminative.
- The number of parameters is reduced from \(7^2\times C^2\) to \(3(3^2\times C^2)\).
Remove LRN (Local Response Normalization) because it increases memory consumption but does not improve accuracy.
The downsampling is achieved by max-pooling with 2x2 window and stride 2.
The number of channels for 3x3 conv gradually increases from 64, 128, 256, to 512.

The final VGG-Net is mainly two versions, VGG-16 and VGG-19, where 16 and 19 refers to the number of conv and fc layers.

The paper reveals the importance of weight initialization. For random initialization with normal distribution, the authors first train a shallower version (7 layers of conv + 3 fc) and then fine-tunes a deeper one by initializing the first 4 conv layers and the last 3 fc layers using the pre-trained shallow model. This is not needed if the Glorot’s initialzation is used. This observation raises some research interests on improving network initialization, such as Kaiming’s initialization in 2015.

The paper describe the multi-scale training and testing and verify the effectiveness via extensive experiments.

For multi-scale training, each image is rescaled to \(S\), which is randomly sampled from \([S_{min}, S_{max}]\). This can be viewed as data augmentation via scale jittering.
For multi-scale testing, each image is rescaled to a pre-defined set of validation scale.
- Fully-convolutional style: The rescaled image is passed through the network whose fc layer is converted to conv layer and the final class confidence map is pooled through global average.
- Multiple-crop style: we can also fix the network but crop the rescaled image into multiple crops with the same training scale \(S\). The final score is averaged across all crops.

In the appendix, the authors also show that VGG is generalizable to other datasets. To do so, the last layer classification is removed and the 4096-d feature from the fc layer can be obtained, aggregated across multiple locations and scales, l2-normalized, and fed into a linear SVM classifier. This can show comparable or superior results on a variety of tasks, from recognition, objection detection, semantic segmentation, etc.

TL;DR

A 16/19-layer very deep network purely composed of 3x3 conv
Detailed description on multi-scale training and testing and extensive supporting experiments.
Strong generlization to a wide range of tasks and datasets.