Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

How to Train Your Long-Context Visual Document Model

About

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

Austin Veselka• 2026

Related benchmarks

TaskDatasetResultRank
Long-context UnderstandingLongBench v2--
37
Visual Question AnsweringSlideVQA--
28
Document UnderstandingMMLBD-C
Accuracy57.3
6
Long-context document understandingMMLB 128K
Accuracy75.6
6
Long-context Multimodal UnderstandingHELMET
Accuracy65.7
6
Long-document Visual Question AnsweringVA
Accuracy94.6
6
Long-document Visual Question AnsweringLCA
Accuracy93.1
6
Document UnderstandingDUDE
Accuracy56
6
Showing 8 of 8 rows

Other info

Follow for update