AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

About

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	FineWeb (val)	--	217
Language Modeling	OpenWebText (val)	--	114
Language Modeling	WikiText (val)	Perplexity21.14	62
Language Modeling	BookCorpus (val)	Perplexity35.09	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord