Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

About

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song• 2026

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LongVideoBench	--	290
Long Video Understanding	MLVU	--	265
Long Video Understanding	Video-MME	Overall Score67.78	90
Long-document Visual Question Answering	MMLongBench 128K context	MMLB-D34.19	22
Long-document Visual Question Answering	MMLongBench Overall	Average Score57.7	22
Long-document Visual Question Answering	MMLongBench 64K context	MMLB-D36	22
Long-context Multi-modal Understanding	MM-NIAH 128K	Retrieval Score57.83	6
Multi-modal Needle-In-A-Haystack	MM-NIAH 64K	Retrieval Score (Ret.)74.83	6
Video-Text Compression Evaluation	VTCBench-Wild	Retrieval Score91.75	6
Long-document Visual Question Answering	MMLongBench 512K context	MMLongBench-D Score31.91	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord