Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
About
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy64.4 | 425 | |
| Video Understanding | VideoMME | Score (Long)54.9 | 248 | |
| Video Question Answering | VideoMME | -- | 210 | |
| Video Understanding | Video-MME without subtitles | -- | 89 | |
| Temporal Grounding | Charades-STA | R@0.545.6 | 88 | |
| Video-based Question Answering | STAR | Accuracy70.5 | 50 | |
| Spatio-Temporal Reasoning | V-Star | Chain1 (When) m tIoU26.4 | 44 | |
| Video Understanding | VideoMMMU | Accuracy52.3 | 32 | |
| Video Understanding | WorldSense | Score37.5 | 25 | |
| Visual Question Answering | CameraBench | Motion Steadiness Accuracy57.6 | 21 |