VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

About

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.5	2056
Multimodal Understanding	MMBench	--	887
Diagram Question Answering	AI2D	--	509
Multimodal Understanding	MMBench CN	--	302
Multimodal Understanding	MMMU	MMMU Score39.7	110
Visual Question Answering	SEED-Bench Image	Accuracy72.5	80
Scientific Question Answering	SciQA	--	35
Text-based Visual Question Answering	TextVQA	ANLS0.737	33
Multimodal Evaluation	MME	Perception Score (P)1.50e+3	18

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord