Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

About

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb (val)--
156
Language ModelingOpenWebText (val)--
70
Language ModelingWikiText (val)
Perplexity21.14
34
Language ModelingBookCorpus (val)
Perplexity35.09
5
Showing 4 of 4 rows

Other info

Follow for update