Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

About

How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, Nong Sang• 2024

Related benchmarks

TaskDatasetResultRank
Video Anomaly DetectionUCF-Crime
AUC88.96
129
Video Anomaly DetectionUCF-Crime (test)
AUC88.96
122
Video Anomaly DetectionXD-Violence (test)
AP87.68
119
Video Anomaly DetectionUCF-Crime (frame-level)
AUC87.68
32
Open-ended Question AnsweringVad-Reasoning-Plus
BLEU-30.014
27
Multi-choice Question AnsweringVad-Reasoning-Plus
MCQ Score65.5
27
Video Anomaly ReasoningVideo Anomaly Reasoning (test)
RR Score0.007
27
Video Anomaly ReasoningVad-Reasoning-Plus (test)
BLEU0.00e+0
27
Video Anomaly DetectionXD-Violence
AP87.68
14
Frame-level Video Anomaly DetectionXD-Violence
AP88.96
11
Showing 10 of 21 rows

Other info

Code

Follow for update