Unleashing Hour-Scale Video Training for Long Video-Language Understanding

About

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum• 2025

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LVBench	Accuracy45.6	267
Long-form Video Understanding	LongVideoBench	Accuracy60.4	135
Long Video Understanding	VideoMME	Accuracy63.6	97
Long Video Understanding	Video-MME Overall	Accuracy63.6	81
Long Video Understanding	VideoMME Long (3~120 min)	Score55	22
Long Video Understanding	LongVideoBench 0~60 min	Score60.4	17
Long Video Understanding	LVBench 4101 sec	Score45.6	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord