SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning
About
Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ECVA | Accuracy92.22 | 14 | |
| Video Anomaly Question Answering | MSAD | Acc (w/o think)89.58 | 8 | |
| Video Anomaly Question Answering | UCF-Crime | Accuracy (w/o think)92.82 | 8 | |
| Video Anomaly Understanding Evaluation | MSAD | CLS Score7.65 | 8 | |
| Video Anomaly Understanding Evaluation | UCF-Crime | CLS7.22 | 8 | |
| Video Anomaly Reasoning Evaluation | ECVA | CLS Score2.86 | 7 | |
| Temporal Anomaly Grounding | MSAD OOD (test) | mIoU20.4 | 4 | |
| Temporal Anomaly Grounding | ECVA (test) | mIoU44.42 | 4 |