FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

About

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	--	847
Optical Character Recognition	OCRBench	--	433
Science Question Answering	ScienceQA (test)	Average Accuracy77.56	273
Multimodal Understanding	MMMU (val)	MMMU Score77.56	199
Hallucination Evaluation	HallusionBench	--	153
Diagram Understanding	AI2D	AI2D Score86.37	39
Multimodal Reasoning	MMMU	MMMU Score51.56	27
Multimodal Understanding	DeepSeek-VL2 Evaluation Suite	Average Score72.49	10
Prefill Latency	MMMU	Prefill Latency (ms)1.80e+3	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord