CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

About

Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .

Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez• 2025

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy77.4	962
Video Understanding	MVBench	--	563
Visual Question Answering	GQA	Accuracy55	524
Visual Question Answering	AI2D	Accuracy75.1	317
Diagram Understanding	AI2D	Accuracy63.5	317
Visual Question Answering	RealworldQA	Accuracy58.3	259
Long Video Understanding	MLVU	--	205
Chart Understanding	ChartQA	Accuracy82.4	159
Infographic Question Answering	InfoVQA	ANLS48.6	117
Visual Question Answering	TextVQA	Accuracy71	79

Showing 10 of 22 rows

Other info

GitHub

Follow for update

@wizwand_team Discord