Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

About

Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy77.4
496
Visual Question AnsweringGQA
Accuracy55
374
Video UnderstandingMVBench--
247
Visual Question AnsweringAI2D
Accuracy75.1
174
Diagram UnderstandingAI2D
Accuracy63.5
167
Visual Question AnsweringRealworldQA
Accuracy58.3
98
Chart UnderstandingChartQA
Accuracy82.4
83
Visual Question AnsweringTextVQA
Accuracy71
79
Long Video UnderstandingMLVU--
72
Multimodal Model EvaluationMME
Total Score1.62e+3
63
Showing 10 of 22 rows

Other info

GitHub

Follow for update