ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

About

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao• 2026

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU	MMLU Accuracy61.2	456
Multitask Language Understanding	MMLU	--	263
LLM Inference Acceleration	GSM8K	Speedup1.77	61
Online Inference	ShareGPT	--	32
Mathematical Reasoning	GSM8K	EM (Strict)38.13	3
LLM Inference Efficiency	HumanEval	TTFT (ms)5.23e+3	2
Mathematical Reasoning	GSM8K	EM (strict)18.14	2
Instruction Following	IFEval	Instruction Following (Loose)17.93	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord