Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

About

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 11,396 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) VLMs, when augmented with culture-aware retrieval, generally outperform their non-augmented counterparts (by averaging +6% on cVQA and +11% on cIC). (iii) Performance of culture-aware retrieval augmented varies widely across countries. These findings highlight the limitations of current multimodal retrievers and VLMs, underscoring the need to enhance visual culture understanding within RAG systems. We believe RAVENEA offers a valuable resource for advancing research on retrieval-augmented visual culture understanding.

Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders S{\o}gaard, Ivan Vuli\'c, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringCVQA
Accuracy86.1
12
Image CaptioningcIC
RegionScore77.6
8
Multimodal RetrievalRAVENEA (test)
MRR82.17
7
Showing 3 of 3 rows

Other info

Follow for update