Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training
About
Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest attention shift. (2) The head-wise information filtering module suppresses low-contribution attention heads that focus on irrelevant regions, thereby improving attention quality within each layer. Experiments conducted on both the LLaVA and Qwen-VL model families across multiple multimodal benchmarks demonstrate that DualPD consistently improves accuracy without training, confirming its effectiveness and generalizability. The code will be released upon publication.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy81.29 | 1165 | |
| Visual Question Answering | VizWiz | Accuracy75.8 | 1043 | |
| Visual Question Answering | GQA | Accuracy74 | 963 | |
| Text-based Visual Question Answering | TextVQA | Accuracy85.34 | 496 | |
| Visual Question Answering | VQAv2 | Accuracy80.78 | 177 | |
| Document Visual Question Answering | DocVQA | Accuracy86.8 | 81 | |
| Knowledge-based Visual Question Answering | OKVQA | Accuracy0.6045 | 52 |