FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach
About
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | -- | 1085 | |
| Multitask Language Understanding | MMLU | Accuracy73.08 | 413 | |
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy79.51 | 350 | |
| Science Question Answering | ARC Challenge | Accuracy63.4 | 342 | |
| Logical reasoning | BBH | Accuracy71.22 | 201 | |
| Graduate-level Question Answering | GPQA | Accuracy39.06 | 184 | |
| Science Question Answering | ARC Easy | Accuracy86.95 | 155 | |
| General Evaluation | AGIEval | Accuracy56.7 | 29 | |
| Code Generation | MBPP | MBPP Performance Score69.4 | 28 | |
| Aggregate General Language Modeling | Average 10 Benchmarks | Average Score70.04 | 21 |