STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning

About

Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke• 2026

Related benchmarks

Task	Dataset	Result
Video Temporal Grounding	ActivityNet Captions	Recall @ IoU=0.369.8	47
Video Temporal Grounding	ActivityNet TimeLens	R@0.354.7	31
Video Temporal Grounding	Charades-TimeLens	R1@0.357.1	31
Video Question Answering	NExT-GQA	Accuracy73.6	13
Video Understanding	ETBench	RAR Accuracy53.4	10
Video Understanding	MLVU	TR Accuracy80.6	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord