Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReMoT: Reinforcement Learning with Motion Contrast Triplets

About

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy71.4
437
Multimodal UnderstandingMMStar
Accuracy70.4
324
Multi-discipline Multimodal UnderstandingMMMU (val)--
204
Spatial ReasoningVSI-Bench
Avg Score58.8
192
Multimodal UnderstandingMMMU (test)--
112
Visual PerceptionBLINK (val)
Validation Score62.15
29
Motion ReasoningReMoT 16k (test)
Navigation: Camera (Overall)31.5
20
LVLM EvaluationMMStar
CP Score3.2
20
Visual ReasoningVLM2-Bench
Mat57.23
19
Multi-modal ReasoningMuirBench
Difference Reasoning Accuracy62.65
19
Showing 10 of 15 rows

Other info

Follow for update