Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

About

While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

Hoin Jung, Xiaoqian Wang• 2026

Related benchmarks

TaskDatasetResultRank
Scene ClassificationNWPU RESISC45 (test)
Top-1 Accuracy96.74
28
Medical factuality evaluationIU-Chest X-ray (test)
Accuracy68.31
22
Multimodal Retrieval-Augmented GenerationNWPU (test)
Accuracy96.74
22
Multimodal Retrieval-Augmented GenerationFACET
Accuracy92.58
22
Multimodal Retrieval-Augmented GenerationIU-Chest
Accuracy68.31
22
Social fairness evaluationFACET (test)
Accuracy92.58
22
Showing 6 of 6 rows

Other info

Follow for update