DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
About
Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | -- | 2019 | |
| Visual Question Answering | VizWiz | Accuracy61.4 | 1820 | |
| Text-based Visual Question Answering | TextVQA | Accuracy82.9 | 962 | |
| Visual Question Answering | ChartQA | Accuracy82.1 | 519 | |
| Visual Question Answering | ScienceQA | Accuracy83.6 | 446 | |
| Optical Character Recognition | OCRBench | Score871 | 433 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy56.6 | 363 | |
| Visual Question Answering | VQA v2 | Accuracy72.1 | 333 | |
| Visual Question Answering | GQA | Accuracy63.9 | 155 | |
| Document Visual Question Answering | InfoVQA | Accuracy0.791 | 85 |