Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

About

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong• 2026

Related benchmarks

TaskDatasetResultRank
Video Anomaly DetectionXD-Violence (test)
AP85.92
119
Video Anomaly DetectionXD-Violence
AP85.92
66
Video Anomaly DetectionUCF-Crime (frame-level)
AUC89.67
32
Showing 3 of 3 rows

Other info

Follow for update