SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

About

Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.

Zihao Zhao, Shengting Cao, Muchao Ye• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	ECVA	Accuracy92.22	14
Video Anomaly Question Answering	MSAD	Acc (w/o think)89.58	8
Video Anomaly Question Answering	UCF-Crime	Accuracy (w/o think)92.82	8
Video Anomaly Understanding Evaluation	MSAD	CLS Score7.65	8
Video Anomaly Understanding Evaluation	UCF-Crime	CLS7.22	8
Video Anomaly Reasoning Evaluation	ECVA	CLS Score2.86	7
Temporal Anomaly Grounding	MSAD OOD (test)	mIoU20.4	4
Temporal Anomaly Grounding	ECVA (test)	mIoU44.42	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord