Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

About

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.1
935
Multimodal EvaluationMME--
557
Multimodal UnderstandingMM-Vet
MM-Vet Score32.5
418
Multimodal UnderstandingMMBench
Accuracy64.87
367
Multimodal Capability EvaluationMM-Vet
Score46.97
282
Science Question AnsweringScienceQA
Accuracy67.23
229
Object HallucinationPOPE (Random)
F1 Score89.09
200
Multimodal UnderstandingMMStar
Accuracy32.27
197
Object HallucinationPOPE Adversarial
Accuracy86.93
196
Object HallucinationPOPE Popular
F1 Score87.79
188
Showing 10 of 28 rows

Other info

Follow for update