Prodigy: An Expeditiously Adaptive Parameter-Free Learner

About

We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution $D$, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.

Konstantin Mishchenko, Aaron Defazio• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy89.3	3381
Image Classification	TinyImageNet (val)	Accuracy77.944	289
Image Classification	Food-101 (test)	Accuracy72	145
Image Classification	ImageNet-100 (test)	Clean Accuracy78	123
Language Modeling	C4 LLaMA-130M (val)	Perplexity18.727	40
Language Modeling Pre-training	C4 (val)	--	14
Image Classification	CIFAR-10	Latency (ms/iter)29.14	13
Image Classification	MNIST (test)	Accuracy99.55	12
Language Modeling	LLaMA-350M pre-training (val)	Validation Loss2.715	10
MRI Reconstruction	fastMRI	--	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord