Summary

ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky, Sutskever, Hinton; 2012 - Summary

author:	philkr
score:	10 / 10

This is the paper that started the deep learning revolution in 2012. From a core technical level, the architecture looks quite similar to LeNet-5, just bigger. AlexNet

There are a few core difference when scaling up the architecture:

The network uses 5 conv layers and 3 linear (fully-connected) layers
- 2 conv and 1 FC more than LeNet-5
It uses max-pooling
- Pooling is overlapping with an odd kernel size
The first kernel is larger with stride \(4\)
- Reduces computation, and allows for higher input resolution
All but the first convolution use padding

By itself the AlexNet architecture likely didn’t train, and required a fairly large bag of tricks (accumulated over the past 12 years: 2000-2012) to work:

ReLU activations
- They do not saturate and pass the gradient back through the network
Normalizations after the first two conv layers
- This helped gradient back-propagation to initial layers
- Later papers showed that a good initialization makes this normalization obsolete
This was one of the first papers to train on GPUs
- The architecture is split in the middle each half resides on a different GPU
Data-augmentation: flip, crop, and color augmentations
Dropout on the first two FC layers
- Reduces overfitting

Most of the architectural changes and training tricks were well known in the deep learning literature by 2012. Where the paper really shines is the performance and experiments. The paper showed for the first time that deep networks can easily outperform classical methods and hand-designed features in computer vision tasks. The imagenet result shows a huge improvement over the prior SOTA (up to 10% improvement in top-5 accuracy). Within a year almost all of computer vision switched to AlexNet-like architectures.

TL;DR

AlexNet: Larger LeNet-5 architecture
Bag of many network training tricks
Stellar performance on ImageNet