Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Don't Use Large Mini-Batches, Use Local SGD

About

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi• 2018

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet 1k (test)
Top-1 Accuracy74.03
450
Image ClassificationCaltech101 (test)--
159
Image ClassificationCaltech-256 (test)
Top-1 Acc81.14
74
Image ClassificationCIFAR100 (test)--
43
Showing 4 of 4 rows

Other info

Follow for update