Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Motion-o: Trajectory-Grounded Video Reasoning

About

Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy69.2
563
Video UnderstandingVideoMME
Score (Overall)69.7
357
Video UnderstandingWorldSense
Score41.5
25
Spatio-Temporal ReasoningV-STAR (test)
What Accuracy64.1
15
Temporal Video GroundingTVGBench (test)
mIoU39.6
10
Video Motion ReasoningMotionBench (dev)
Overall Accuracy63
10
Showing 6 of 6 rows

Other info

Follow for update