How to Train Your Long-Context Visual Document Model

About

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

Austin Veselka• 2026

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench v2	--	133
Long-context document understanding	MMLongBench-Doc	Accuracy54.9	58
Visual Question Answering	SlideVQA	--	46
Document Understanding	DUDE	Accuracy56	32
Long-context Understanding	HELMET	Accuracy65.7	15
Table Visual Question Answering	TableVQA	Accuracy82.1	11
Visual Document Understanding	MMLongBenchDoc-C	Accuracy57.3	11
Long-context Visual Question Answering	MMLongBench 128K	Accuracy75.6	11
Visual Document Understanding	VA	Accuracy94	11
Visual Document Understanding	LCA	Accuracy92.4	11

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord