FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

About

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy that is comparable with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 32.6% speedup in Time To First Token when serving a converted DeepSeek-V3 architecture with expert parallelism in SGLang and achieve 97.3% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.

Yonatan Dukler, Guihong Li, Deval Shah, Jiang Liu, Vikram Appia, Emad Barsoum• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	--	1581
Science Question Answering	ARC-E	Accuracy87.3	240
Mathematical Reasoning	GSM-8K	Accuracy89.8	107
Commonsense Question Answering	CommonsenseQA	Accuracy84.9	92
Science Question Answering	OpenBookQA	Accuracy45.2	89
Multi-task Language Understanding	MMLU	MMLU Score80.2	86
Commonsense Reasoning	HellaSwag	HellaSwag Score82.9	53
Science Question Answering	ARC-C	ARC-C Score64.6	43
Physical Commonsense Reasoning	PIQA	PIQA Score81.1	16
Code Generation	HumanEval+	HEVAL+ Score73.8	9

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord