Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

About

Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Lei Huang, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Hallucination EvaluationCHAIR
CHAIR_s34.6
393
Object HallucinationPOPE Popular
F1 Score89.24
372
Object HallucinationPOPE Adversarial
Accuracy85.97
353
Object HallucinationPOPE (Random)
F1 Score90.42
324
Hallucination EvaluationMMHal-Bench
MMHal Score3.04
306
Object Hallucination EvaluationPOPE Adversarial
Accuracy84.27
159
Object Hallucination EvaluationCHAIR--
154
Hallucination EvaluationHallusionBench--
153
Multimodal Hallucination EvaluationMMHal-Bench
Average Score3.24
129
Showing 10 of 20 rows

Other info

Follow for update