On the robustness of multimodal language model towards distractions

About

Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrate a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved solution accuracy, our analysis shows that there remain significant opportunities for improvement.

Ming Liu, Hao Chen, Jindong Wang, Wensheng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Hallucination Evaluation	POPE	--	217
Hallucination assessment	HallusionBench	Answer Accuracy (aAcc)68.45	39
Multi-modal Visual Capability	MMStar	Score61.2	29
Multi-image visual perception	BLINK	Accuracy54.28	26
Perceptual Robustness	VSTAR	Overall Accuracy76.05	9
Perceptual Robustness	HRBench 4K	Overall Score67.12	9
Perceptual Robustness	HRBench-8K	Overall Score66.75	9
Real-world Understanding	RealworldQA	Score68.76	9
Multidisciplinary knowledge and reasoning	MMMU (dev)	Score20.67	9

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord