Motion-o: Trajectory-Grounded Video Reasoning

About

Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy69.2	563
Video Understanding	VideoMME	Score (Overall)69.7	357
Video Understanding	WorldSense	Score41.5	25
Spatio-Temporal Reasoning	V-STAR (test)	What Accuracy64.1	15
Temporal Video Grounding	TVGBench (test)	mIoU39.6	10
Video Motion Reasoning	MotionBench (dev)	Overall Accuracy63	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord