CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

About

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.

Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le• 2022

Related benchmarks

Task	Dataset	Result
Video Anomaly Detection	UCF-Crime	AUC87.58	263
Video Anomaly Detection	UCF-Crime (test)	AUC87.58	164
Video Anomaly Detection	XD-Violence (test)	AP82.19	164
Video Anomaly Detection	XD-Violence	AP82.19	123
Video Anomaly Detection	ShanghaiTech	--	51
Video Anomaly Detection	XD-Violence	AP82.19	36
Video Anomaly Detection	UCF-Crime (frame-level)	AUC87.58	32
Weakly Supervised Video Anomaly Detection	UCFCrime 1.0 (test)	AUC87.58	23
Weakly Supervised Video Anomaly Detection	UCF-Crime	AUC87.58	18
Coarse-grained Video Anomaly Detection	UCF-Crime	AUC87.58	12

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord