Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
About
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy64.4 | 563 | |
| Video Understanding | VideoMME | Score (Overall)63.6 | 357 | |
| Video Question Answering | VideoMME | -- | 251 | |
| Video Understanding | Video-MME without subtitles | -- | 108 | |
| Temporal Grounding | Charades-STA | mIoU42.5 | 107 | |
| Long Video Understanding | VideoMME | Accuracy63.6 | 89 | |
| Video Understanding | VideoMMMU | Accuracy52.3 | 59 | |
| Video-based Question Answering | STAR | Accuracy70.5 | 50 | |
| Spatio-Temporal Reasoning | V-Star | Chain1 (When) m tIoU26.4 | 44 | |
| Video Understanding | WorldSense | Score37.5 | 25 |