Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

About

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy that is comparable with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 32.6% speedup in Time To First Token when serving a converted DeepSeek-V3 architecture with expert parallelism in SGLang and achieve 97.3% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.

Yonatan Dukler, Guihong Li, Deval Shah, Jiang Liu, Vikram Appia, Emad Barsoum• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande--
1442
Science Question AnsweringARC-E
Accuracy87.3
240
Mathematical ReasoningGSM-8K
Accuracy89.8
107
Commonsense Question AnsweringCommonsenseQA
Accuracy84.9
92
Multi-task Language UnderstandingMMLU
MMLU Score80.2
86
Science Question AnsweringOpenBookQA
Accuracy45.2
82
Commonsense ReasoningHellaSwag
HellaSwag Score82.9
53
Science Question AnsweringARC-C
ARC-C Score64.6
43
Physical Commonsense ReasoningPIQA
PIQA Score81.1
16
Code GenerationHumanEval+
HEVAL+ Score73.8
9
Showing 10 of 10 rows

Other info

Follow for update