ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

About

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.7	2019
Visual Question Answering	VQA v2	Accuracy76.6	1429
Text-based Visual Question Answering	TextVQA	Accuracy55.5	962
Multimodal Understanding	MMBench	--	847
Visual Question Answering	GQA	Accuracy60.4	524
Multimodal Understanding	MMBench CN	Accuracy57.7	254
Science Question Answering	ScienceQA SQA-IMG	Accuracy69.3	186
Multimodal Understanding	MMBench (MMB)	Accuracy64	166
Multimodal Perception	MME Perception	Perception Score1.46e+3	99
Multimodal Understanding	SEED-I Image	Accuracy0.632	75

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord