TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

About

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun• 2025

Related benchmarks

Task	Dataset	Result
Long Video Understanding	Video-MME	Overall Score59.6	90
Long-form Video Understanding	LVBench	Overall Score46.4	77
Multi-robot cooperative Visual Question Answering	CoopSR iGibson	T1 Score0.8611	25
Multi-robot cooperative Visual Question Answering	CoopSR Habitat	Success Rate (T1)42.51	25
Long Video Understanding	LongVideoBench	Average Performance58.6	24
Video Question Answering	LVBench (val)	Score45.3	16
VideoQA	Video-MME	VQA Accuracy (Overall)65.5	13
VideoQA	MLVU	Mean Score76.3	12
VideoQA	LongVideoBench	Score (All Lengths)63.9	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord