Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

About

Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .

Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy77.4
807
Visual Question AnsweringGQA
Accuracy55
505
Video UnderstandingMVBench--
425
Visual Question AnsweringAI2D
Accuracy75.1
249
Diagram UnderstandingAI2D
Accuracy63.5
247
Visual Question AnsweringRealworldQA
Accuracy58.3
179
Long Video UnderstandingMLVU
Score65.1
154
Chart UnderstandingChartQA
Accuracy82.4
127
Infographic Question AnsweringInfoVQA
ANLS48.6
90
Visual Question AnsweringTextVQA
Accuracy71
79
Showing 10 of 22 rows

Other info

GitHub

Follow for update