Lookahead Optimizer: k steps forward, 1 step back

About

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba• 2019

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100 (val)	Accuracy78.34	781
Natural Language Inference	RTE	Accuracy59.9	590
Language Modeling	Penn Treebank (test)	Perplexity57.72	420
Long-term forecasting	ETTh1	MSE0.465	409
Image Classification	SVHN	--	395
Image Classification	CIFAR-10 (val)	Top-1 Accuracy95.27	377
Image Classification	ImageNet (val)	Top-1 Accuracy75.49	354
Long-term forecasting	ETTm2	MSE0.284	350
Long-term forecasting	ETTh2	MSE0.386	310
Sentiment Analysis	IMDB (test)	Accuracy91.1	306

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord