Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

About

Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7$\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \textbf{InfiniteVL-Offline} for offline retrieval and \textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA (val)
VQA Score78.5
343
Multimodal UnderstandingMMStar
Accuracy55.6
324
Optical Character RecognitionOCRBench--
232
Multimodal ReasoningMMMU (val)
Accuracy44
144
Multimodal UnderstandingSEED-Bench Image
Accuracy72.9
121
Mathematical ReasoningMathVista mini
Accuracy65.4
102
Chart UnderstandingChartQA (test)
Accuracy82
92
Multimodal UnderstandingMME
Score2.13e+3
83
Document UnderstandingDocVQA (test)
Accuracy91.7
39
Real-world QARealworldQA
Accuracy67.3
33
Showing 10 of 13 rows

Other info

GitHub

Follow for update