ReMoT: Reinforcement Learning with Motion Contrast Triplets
About
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMMU | Accuracy71.4 | 437 | |
| Multimodal Understanding | MMStar | Accuracy70.4 | 324 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | -- | 204 | |
| Spatial Reasoning | VSI-Bench | Avg Score58.8 | 192 | |
| Multimodal Understanding | MMMU (test) | -- | 112 | |
| Visual Perception | BLINK (val) | Validation Score62.15 | 29 | |
| Motion Reasoning | ReMoT 16k (test) | Navigation: Camera (Overall)31.5 | 20 | |
| LVLM Evaluation | MMStar | CP Score3.2 | 20 | |
| Visual Reasoning | VLM2-Bench | Mat57.23 | 19 | |
| Multi-modal Reasoning | MuirBench | Difference Reasoning Accuracy62.65 | 19 |