Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

About

Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, Yaowei Wang, Qingmin Liao• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationMSCOCO POPE
Random Accuracy90.43
71
Discriminative Object HallucinationPOPE MSCOCO Adversarial
Accuracy83.17
43
Discriminative Object HallucinationPOPE MSCOCO (Random)
Accuracy90.43
29
Discriminative Object HallucinationPOPE MSCOCO Popular
Accuracy87.07
29
Image CaptioningMSCOCO--
26
Showing 5 of 5 rows

Other info

Follow for update