Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

About

Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7$\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \textbf{InfiniteVL-Offline} for offline retrieval and \textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringChartQA
Accuracy82
519
Optical Character RecognitionOCRBench
Score79.8
433
Multimodal UnderstandingMMStar
Accuracy55.6
407
Visual Question AnsweringTextVQA (val)
VQA Score78.5
365
Visual Question AnsweringAI2D
Accuracy77.2
317
Visual Question AnsweringRealworldQA
Accuracy67.3
259
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy44
212
Visual Question AnsweringTextVQA
TextVQA Accuracy78.5
210
Visual Question AnsweringDocVQA
Accuracy91.7
205
Multimodal ReasoningMMMU (val)
Accuracy44
168
Showing 10 of 23 rows

Other info

GitHub

Follow for update