Towards Temporal Compositional Reasoning in Long-Form Sports Videos

About

Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.

Siyu Cao, Lu Zhang, Ruizhe Zeng, Zhi-yong Liu• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	LVBench	Average Score59.9	75
Video Understanding	MLVU	Score76.2	24
Sports Video Question Answering	SportsTime	Perception Score25.12	17
Sports Video Question Answering	SportsTime 200 stratified samples (test)	Human Average Score30.5	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord