ReMoT: Reinforcement Learning with Motion Contrast Triplets

About

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU	Accuracy71.4	437
Multimodal Understanding	MMStar	Accuracy70.4	407
Spatial Reasoning	VSI-Bench	Avg Score58.8	255
Multi-discipline Multimodal Understanding	MMMU (val)	--	212
Multimodal Understanding	MMMU (test)	--	112
Visual Perception	BLINK (val)	Validation Score62.15	44
Motion Reasoning	ReMoT 16k (test)	Navigation: Camera (Overall)31.5	20
LVLM Evaluation	MMStar	CP Score3.2	20
Visual Reasoning	VLM2-Bench	Mat57.23	19
Multi-modal Reasoning	MuirBench	Difference Reasoning Accuracy62.65	19

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord