Dynamic Adversarial Reinforcement Learning for Robust Multimodal Large Language Models

About

Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbf{AOT-SFT}, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbf{AOT (Adversarial Opponent Training)}, a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender's perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.

Yicheng Bao, Xuhong Wang, Qiaosheng Zhang, Chaochao Lu, Xia Hu, Xin Tan• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	RealworldQA	Accuracy70.07	259
Hallucination Evaluation	POPE	--	217
Multimodal Understanding	MMMU (val)	--	199
Multimodal Understanding	SEEDBench2 Plus	Accuracy70.05	138
Hallucination assessment	HallusionBench	Answer Accuracy (aAcc)69.19	39
Multi-modal Visual Capability	MMStar	Score61.53	29
Multi-image visual perception	BLINK	Accuracy55.92	26
Multimodal Understanding	MMMU (dev)	Accuracy25.33	25
High-Resolution Multimodal Understanding	HRBench-8K	Accuracy71.5	13
Multidisciplinary knowledge and reasoning	MMMU (dev)	Score25.33	9

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord