Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

About

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy64.4	635
Video Understanding	VideoMME	Score (Overall)63.6	369
Video Question Answering	VideoMME	--	254
Video Understanding	Video-MME without subtitles	--	145
Temporal Grounding	Charades-STA	mIoU42.5	120
Temporal Grounding	ActivityNet	Recall@0.349.5	111
Long Video Understanding	VideoMME	Accuracy63.6	97
Video Understanding	VideoMMMU	Accuracy52.3	67
Video Reasoning	LongVideoReason	Accuracy69.4	61
Video-based Question Answering	STAR	Accuracy70.5	50

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord