InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

About

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	--	635
Visual Object Tracking	LaSOT (test)	--	470
Visual Object Tracking	GOT-10k (test)	--	461
Long Video Understanding	LongVideoBench	Score60.6	290
Long Video Understanding	LongVideoBench (val)	Accuracy63.2	282
Long Video Understanding	LVBench	Accuracy0.464	267
Long Video Understanding	MLVU	--	265
Video Question Answering	LongVideoBench	Accuracy60.6	224
Video Understanding	VideoMME	--	222
Video Question Answering	MLVU	Accuracy72.8	213

Showing 10 of 86 rows

...

Other info

Code

Follow for update

@wizwand_team Discord