Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

About

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT workload, GPA achieves speedups of 7% and 25.5% in the small and large batch settings respectively. Furthermore, we prove that for any base optimizer with $O(\sqrt{T})$ regret, where $T$ is the number of iterations, GPA matches or exceeds the original convergence guarantees depending on the interpolation constants.

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Language Model Pre-trainingC4 Llama-160M scratch (val)
Validation Loss3.0908
20
Language Model Pre-trainingC4 Llama-1B scratch (val)
Validation Loss2.645
16
Showing 2 of 2 rows

Other info

Follow for update