Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

About

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

Ning Liao, Xiaoxing Wang, Xiaohan Qin, Junchi Yan• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande--
1085
Multitask Language UnderstandingMMLU
Accuracy73.08
413
Commonsense ReasoningHellaSwag
HellaSwag Accuracy79.51
350
Science Question AnsweringARC Challenge
Accuracy63.4
342
Logical reasoningBBH
Accuracy71.22
201
Graduate-level Question AnsweringGPQA
Accuracy39.06
184
Science Question AnsweringARC Easy
Accuracy86.95
155
General EvaluationAGIEval
Accuracy56.7
29
Code GenerationMBPP
MBPP Performance Score69.4
28
Aggregate General Language ModelingAverage 10 Benchmarks
Average Score70.04
21
Showing 10 of 11 rows

Other info

GitHub

Follow for update