Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

About

Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .

Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy77.4
962
Video UnderstandingMVBench--
563
Visual Question AnsweringGQA
Accuracy55
524
Visual Question AnsweringAI2D
Accuracy75.1
317
Diagram UnderstandingAI2D
Accuracy63.5
317
Visual Question AnsweringRealworldQA
Accuracy58.3
259
Long Video UnderstandingMLVU--
205
Chart UnderstandingChartQA
Accuracy82.4
159
Infographic Question AnsweringInfoVQA
ANLS48.6
117
Visual Question AnsweringTextVQA
Accuracy71
79
Showing 10 of 22 rows

Other info

GitHub

Follow for update