MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

About

Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .

Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	--	246
Instruction Following	Alpaca	--	173
Question Answering	QA	--	47
Code Generation	HumanEval	TPS (Tokens/s)28.24	31
Text Summarization	CNN/DM	TPS50.17	13
Chat Evaluation	MT-Bench	Throughput (TPS)24.1	10
Language Understanding	MMLU-Pro	TPS25.8	10
Instruction Following	MT-Bench	Throughput (TPS)47.1	9
Code Generation	HumanEval	Throughput (tokens/s)47.26	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord