Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

About

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.1
2019
Multimodal UnderstandingMMBench
Accuracy64.87
847
Science Question AnsweringScienceQA
Accuracy67.23
791
Multimodal EvaluationMME--
727
Multimodal UnderstandingMM-Vet
MM-Vet Score32.5
631
Multimodal UnderstandingMMStar
Accuracy32.27
407
Hallucination EvaluationCHAIR
CHAIR_s52.2
393
Multimodal Capability EvaluationMM-Vet
Score46.97
393
Object HallucinationPOPE Popular
F1 Score87.79
372
Object HallucinationPOPE Adversarial
Accuracy86.93
353
Showing 10 of 38 rows

Other info

Follow for update