Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

About

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Chaoyu Li, Eun Woo Im, Pooyan Fazli• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA (test)
Accuracy49.1
275
Video Hallucination EvaluationVideoHallucer
ORH36
25
Temporal UnderstandingTempCompass, TVBench
TempCompass Score0.732
17
Conventional Video UnderstandingVideoMMe, MVBench
VideoMMe Score53.2
17
Hallucination ExaminationVidHalluc, VideoHallucer, EventHallusion
VidHalluc Score73.7
17
Hallucination ExaminationVidHalluc
BQA75.86
15
Video Understanding and ReasoningVideo-MME (test)
Overall Accuracy59.2
15
Hallucination EvaluationVRIPT-HAL (test)
F1 Score49.2
15
Hallucination EvaluationEventHallusion binary QA (test)
Accuracy0.626
15
Video Understanding and ReasoningVideo-MMMU (test)
Overall Score0.463
15
Showing 10 of 14 rows

Other info

Follow for update