Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Temporal Preference Optimization for Long-Form Video Understanding

About

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.

Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench
Score60.1
248
Long Video UnderstandingMLVU
Score71.1
154
Video Hallucination EvaluationVideoHallucer
Overall Score51.3
35
Grounded Video Question AnsweringNExT-GQA (test)
mIoU27.7
32
Long Video UnderstandingVideo-MME
Overall Score65.6
30
Conventional Video UnderstandingVideoMMe, MVBench
VideoMMe Score53
17
Temporal UnderstandingTempCompass, TVBench
TempCompass Score0.695
17
Hallucination ExaminationVidHalluc, VideoHallucer, EventHallusion
VidHalluc Score64.4
17
Hallucination ExaminationVidHalluc
BQA74.85
15
Hallucination ExaminationEventHallusion
Average Score63.33
15
Showing 10 of 11 rows

Other info

Follow for update