Motion-o: Trajectory-Grounded Video Reasoning
About
Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy69.2 | 563 | |
| Video Understanding | VideoMME | Score (Overall)69.7 | 357 | |
| Video Understanding | WorldSense | Score41.5 | 25 | |
| Spatio-Temporal Reasoning | V-STAR (test) | What Accuracy64.1 | 15 | |
| Temporal Video Grounding | TVGBench (test) | mIoU39.6 | 10 | |
| Video Motion Reasoning | MotionBench (dev) | Overall Accuracy63 | 10 |