Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

About

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy90.2	2019
Real-world Visual Question Answering	RealworldQA	Accuracy74.9	173
Chart Understanding and Reasoning	ChartQA	Accuracy89	87
OCR Performance Evaluation	OCRBench	Score86.2	68
OCR Visual Question Answering	TextVQA	Accuracy80.5	57
Knowledge-Intensive Visual Question Answering	InfoSeek (val)	Accuracy (All)36.8	50
Knowledge-Intensive Visual Question Answering	E-VQA (test)	Accuracy (All)36.4	34
Knowledge-based Visual Question Answering	ViQuAE (test)	Overall Accuracy61	20
Knowledge-based Visual Question Answering	KB-VQA Aggregate	Average Score37.5	20
Vision-Centric Question Answering	V-Star	Accuracy83.6	20

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord