VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
About
Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | VideoMME | Score (Overall)63.3 | 357 | |
| Temporal Grounding | Charades-STA | mIoU42.7 | 107 | |
| Video Understanding | VideoMMMU | -- | 59 | |
| Video Understanding | WorldSense | Score43.8 | 25 | |
| Long Video Reasoning | LongVideoReason (eval) | Accuracy70.7 | 20 | |
| Temporal Video Grounding | TVGBench | mIoU21 | 13 | |
| Spatio-Temporal Grounding | V-Star | Accuracy61.2 | 12 | |
| Video Understanding | WorldSense (test) | Overall Accuracy39.2 | 8 | |
| Video Reasoning | Video-MME v2 | Non-Linear Score8.6 | 6 | |
| Video Understanding | PerceptionTest | Overall Score68.7 | 5 |