SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

About

As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.

Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MLVU	Accuracy49.7	147
Video Understanding	LongVideoBench	Accuracy37.4	128
Video Understanding	LVBench	Overall Accuracy31.8	95
Video Understanding	MMVU	Accuracy55.7	91
Temporal Grounding	Charades-STA (test)	--	68
Video Understanding	Video-MME w/o sub	Accuracy44.1	46
Video Reasoning	SAGE-Bench 1.0 (test)	Overall Score73.4	29
Video Understanding	VideoMME w/ sub	Accuracy52.4	15
Video Reasoning	MINERVA overall (test)	Accuracy32.9	8
Video Reasoning	MINERVA 600+s (test)	Accuracy29	8

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord