MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
About
The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy81.2 | 351 | |
| Mathematical Reasoning | AIME24 | Accuracy92 | 130 | |
| Mathematical Reasoning | HMMT25 | Accuracy78 | 78 | |
| Mathematical Reasoning | MATH500 | Accuracy38.2 | 57 | |
| Hard LLM Reasoning | HLE | Accuracy7 | 10 | |
| Mathematical Reasoning | AIME 25 | Accuracy86.7 | 10 | |
| Science Question Answering | ARC Challenge | Accuracy65.2 | 10 | |
| Science Question Answering | ARC Easy | Accuracy84.3 | 10 |