Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning

About

Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke• 2026

Related benchmarks

TaskDatasetResultRank
Video Temporal GroundingActivityNet Captions
Recall @ IoU=0.548.4
43
Video Temporal GroundingActivityNet TimeLens
R@0.354.7
31
Video Temporal GroundingCharades-TimeLens
R1@0.357.1
31
Video Question AnsweringNExT-GQA
Accuracy73.6
13
Video UnderstandingETBench
RAR Accuracy53.4
10
Video UnderstandingMLVU
TR Accuracy80.6
4
Showing 6 of 6 rows

Other info

Follow for update