DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
About
Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | C4 (test) | Perplexity34.21 | 464 | |
| Language Modeling | FineWeb-Edu (test) | Perplexity (Test)24.69 | 58 | |
| Language Modeling | The Pile (test) | PPL (The Pile Test)10.27 | 53 | |
| Language Modeling | Wiki (test) | Perplexity20.54 | 2 | |
| Language Understanding and Reasoning | Downstream Task Suite (PIQA, ARC-e, HellaSwag, GPQA, Lambada, MMLU, BBH) | PIQA50.67 | 2 |