Mitigating Object Hallucinations via Sentence-Level Early Intervention
About
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy61 | 1453 | |
| Visual Question Answering | VQA v2 | Accuracy79.9 | 1429 | |
| Text-based Visual Question Answering | TextVQA | Accuracy82.2 | 962 | |
| Multimodal Capability Evaluation | MM-Vet | Score36.2 | 393 | |
| Visual Question Answering | VQA v2 | Accuracy84 | 333 | |
| Hallucination Evaluation | AMBER | CHAIR2.9 | 222 | |
| Hallucination Evaluation | POPE | -- | 217 | |
| Multimodal Evaluation | MM-Vet | -- | 196 | |
| Hallucination Evaluation | HallusionBench | Accuracy47.56 | 153 | |
| Multimodal Hallucination Evaluation | MMHal-Bench | Average Score2.48 | 129 |