Summary

Cyclical Learning Rates for Training Neural Networks, Smith; 2015 - Summary

author:	liyanc
score:	5 / 10

What is the core idea?

The author proposed a method to for adaptively controlling learning rate in order to eliminate the manual tuning of learning rate and significantly improve the convergence speed. Common methods require manual tuning of the learning rate and such empirical tuning has significant impact on the model convergence speed as well as the performance. Additionally, popular exponential decay of the learning rate requires longer training time and introduces another empirical tuning parameter.

Therefore, this paper proposed an adaptive search and decay of learning rates waiving empirical trials. The proposed method builds on the observation that larger learning rates improve short-term results while smaller learning rates benefit long term results (specifically model generalizability, albeit not mentioned by the author). As a result, a cyclic sweeping learning rate can take advantage of both worlds.
How is it realized (technically)?

Essentially, the author proposed to sweep the learning rate back and forth within a range as shown in the the figure.

As for the shape of swing (triangular, parabolic, sinusoidal), the author suggests that empirical results show it irrelavent.

There are two key considerations for this method: how to find the step size and how to determine the appropriate range for the swing.
- The optimal step size should cover 2-10 epoches.
- Perform a set of linear scanning of learning rates for training and locate a resonable range from the scanning result as shown in an example on CIFAR-10

3F0AD94F-DD63-4B9B-BA79-9C9AB0556100

How well does the paper perform?

The author first validates and analyzes the method on CIFAR-10 dataset. The test accuracy for each training protocol is shown in the figure, which demonstrates that the proposed method achieved an optimal performance with significantly shorter training time. In order to invalidate the hypothesis that only the smaller end of learning rates contribute to the performance, the author designed a decay protocol to decay the learningrate within the first stepsize and fix it to the end. The result is inferior and therefore shows that either large or small learning rates along couldn’t achieve the performance, but the cyclic sweeping characterastics achieve the result.

The author performed experiments on CIFAR-100 dataset with ResNet, Stochastic Depth, and DenseNet as well. The proposed method demonstrates a similar edge over other baselines. The table shows the results of baselines starting at three different learning rates and the proposed sweeping learning rate.

Finally, the author performed training and tests on ImageNet with AlexNet and GoogleNet. All the results show the proposed method gives a lead over the baseline in terms of the final performance and the convergence speed.

Note that the experiments are all perfomed on relatively small datasets with the simple task for classification, the conclusion might not well generalize to today’s work. This proposed method could potentially improve the generalizability of trained models according to more recent works since it’s likely that sweeping learning rates leads to “flatter” minimas, which would imply better generalizability. It would be interesting to see follow-up works with further analysis with this respect.

TL;DR

Sweeping learning rates within a range can take advantage of both large learning rates and small learning rates.
Step size should cover 2-10 epoches, and the range should be determined by pre-scanning over possible ranges.
Cyclic learning rates show a significant improvement on relatively small datasets for image classification.