DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

About

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	VizWiz	Accuracy61.4	1863
Text-based Visual Question Answering	TextVQA	Accuracy82.9	984
Visual Question Answering	ChartQA	Accuracy82.1	620
Visual Question Answering	ScienceQA	Accuracy83.6	525
Optical Character Recognition	OCRBench	Score871	486
Multi-discipline Multimodal Understanding	MMMU	Accuracy56.6	422
Visual Question Answering	VQA v2	Accuracy72.1	347
Visual Question Answering	GQA	Accuracy63.9	218
Multimodal Reasoning	MMStar	Accuracy63.2	102

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord