STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
About
Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Temporal Grounding | ActivityNet Captions | Recall @ IoU=0.548.4 | 43 | |
| Video Temporal Grounding | ActivityNet TimeLens | R@0.354.7 | 31 | |
| Video Temporal Grounding | Charades-TimeLens | R1@0.357.1 | 31 | |
| Video Question Answering | NExT-GQA | Accuracy73.6 | 13 | |
| Video Understanding | ETBench | RAR Accuracy53.4 | 10 | |
| Video Understanding | MLVU | TR Accuracy80.6 | 4 |