Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Harnessing Large Language Models for Training-free Video Anomaly Detection

About

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci• 2024

Related benchmarks

TaskDatasetResultRank
Video Anomaly DetectionUCF-Crime
AUC80.28
218
Video Anomaly DetectionXD-Violence (test)
AP62.01
146
Video Anomaly DetectionUCF-Crime (test)
AUC80.28
122
Anomaly DetectionUCF-Crime (test)
AUC0.8028
109
Video Anomaly DetectionXD-Violence
AP62.01
93
Video Anomaly DetectionUBnormal (test)
AUC64.23
44
Video Anomaly DetectionUCF-Crime (frame-level)
AUC80.82
32
Temporal Anomaly LocalizationXD-Violence (test)
AP (%)62.01
18
Video Anomaly DetectionUCF-Crime 6 (clip-level)
Accuracy77.24
16
Video Anomaly DetectionXD-Violence
AUC85.36
15
Showing 10 of 18 rows

Other info

Follow for update