Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

About

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringVizWiz
Accuracy61.4
1820
Text-based Visual Question AnsweringTextVQA
Accuracy82.9
962
Visual Question AnsweringChartQA
Accuracy82.1
519
Visual Question AnsweringScienceQA
Accuracy83.6
446
Optical Character RecognitionOCRBench
Score871
433
Multi-discipline Multimodal UnderstandingMMMU
Accuracy56.6
363
Visual Question AnsweringVQA v2
Accuracy72.1
333
Visual Question AnsweringGQA
Accuracy63.9
155
Document Visual Question AnsweringInfoVQA
Accuracy0.791
85
Showing 10 of 21 rows

Other info

Follow for update