Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

About

Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to improve expert locality and prefetch effectiveness under tight memory budgets. We implement VisMMoE on multiple frameworks and evaluate it on representative VL-MoE models and benchmarks. VisMMoE improves end-to-end inference performance by up to 2.68x and 1.61x, respectively, over strong baselines for today's VL-MoE deployments while maintaining competitive accuracy.

Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Multimodal UnderstandingMMBench
Accuracy84.6
847
Optical Character RecognitionOCRBench
Score801
433
Vision-Language EvaluationMME
MME Score2.47e+3
36
Showing 4 of 4 rows

Other info

Follow for update