Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Global Context Compression with Interleaved Vision-Text Transformation

About

Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy30.15
643
Mathematical ReasoningGSM8K
Accuracy87.15
212
Long-context UnderstandingLongBench (test)
SingleDoc Performance45.2
30
Language UnderstandingCMMLU
Accuracy75.12
27
Algebraic ReasoningAQUA
Accuracy32.1
15
Question AnsweringLooGLE Long Dependency QA
BLEU-10.0942
12
SummarizationLooGLE ArXiv Paper Summarization
BLEU-129.15
11
Image CaptioningCOCO (test)
ROUGE-10.518
9
Optical Character RecognitionWuDao rendered text images (test)
ROUGE (R=2)0.981
9
Showing 9 of 9 rows

Other info

Follow for update