Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

About

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy54.2
1453
Mathematical ReasoningMathVista
Score35
474
Multimodal UnderstandingMMStar--
407
Multimodal UnderstandingMMMU
MMMU Score49.9
232
Hallucination DiagnosisHallusionBench
LI Score96
15
Showing 5 of 5 rows

Other info

Follow for update