Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

About

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy64.4
425
Video UnderstandingVideoMME
Score (Long)54.9
248
Video Question AnsweringVideoMME--
210
Video UnderstandingVideo-MME without subtitles--
89
Temporal GroundingCharades-STA
R@0.545.6
88
Video-based Question AnsweringSTAR
Accuracy70.5
50
Spatio-Temporal ReasoningV-Star
Chain1 (When) m tIoU26.4
44
Video UnderstandingVideoMMMU
Accuracy52.3
32
Video UnderstandingWorldSense
Score37.5
25
Visual Question AnsweringCameraBench
Motion Steadiness Accuracy57.6
21
Showing 10 of 18 rows

Other info

Follow for update