InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

About

Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7$\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \textbf{InfiniteVL-Offline} for offline retrieval and \textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	ChartQA	Accuracy82	519
Optical Character Recognition	OCRBench	Score79.8	433
Multimodal Understanding	MMStar	Accuracy55.6	407
Visual Question Answering	TextVQA (val)	VQA Score78.5	365
Visual Question Answering	AI2D	Accuracy77.2	317
Visual Question Answering	RealworldQA	Accuracy67.3	259
Multi-discipline Multimodal Understanding	MMMU (val)	Accuracy44	212
Visual Question Answering	TextVQA	TextVQA Accuracy78.5	210
Visual Question Answering	DocVQA	Accuracy91.7	205
Multimodal Reasoning	MMMU (val)	Accuracy44	168

Showing 10 of 23 rows

Other info

GitHub

Follow for update

@wizwand_team Discord