Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

About

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.

Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le• 2022

Related benchmarks

TaskDatasetResultRank
Video Anomaly DetectionUCF-Crime
AUC87.58
129
Video Anomaly DetectionUCF-Crime (test)
AUC87.58
122
Video Anomaly DetectionXD-Violence (test)
AP82.19
119
Video Anomaly DetectionXD-Violence
AP82.19
66
Video Anomaly DetectionShanghaiTech--
51
Video Anomaly DetectionUCF-Crime (frame-level)
AUC87.58
32
Weakly Supervised Video Anomaly DetectionUCFCrime 1.0 (test)
AUC87.58
23
Weakly Supervised Video Anomaly DetectionUCF-Crime
AUC87.58
18
Coarse-grained Video Anomaly DetectionUCF-Crime
AUC87.58
12
Frame-level Video Anomaly DetectionXD-Violence
AP82.19
11
Showing 10 of 13 rows

Other info

Follow for update