DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

About

Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-task, multi-sport video understanding. DeepSport shifts from passive frame processing to active, iterative reasoning, dynamically extracting frames to "think with videos." To train our model, we curate a unified 78k-sample dataset via a rigorous three-step text-and-vision distillation pipeline. We then employ a progressive two-stage training strategy: a Sports Curriculum Supervised Fine-Tuning phase to build foundational perception, followed by Agentic Reinforcement Learning with a novel tool-use reward. Extensive experiments on a comprehensive 6.7k benchmark demonstrate that DeepSport achieves state-of-the-art performance, outperforming powerful proprietary and open-source models, while utilizing significantly fewer frames. Furthermore, it exhibits strong zero-shot transferability to unseen sports and broad motion recognition tasks, establishing a highly efficient and generalized foundation for complex video reasoning.

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	LVBench	Overall Accuracy32	95
Motion Understanding	MotionBench	Accuracy48.5	35
General Video Understanding	LongVideoBench	Accuracy45.9	24
Sports Video Understanding	DeepSport (test)	Fine-Grained Recognition Accuracy49.89	13
Action & Motion Recognition	DREAM 1k	F1 Score30.5	2
Action & Motion Recognition	ActionAtlas (unseen sports)	Accuracy27.2	2
General Video Understanding	VideoMME Long	Accuracy40.4	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord