Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

About

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA (val)
VQA Score78.5
309
Multimodal UnderstandingMMStar
Accuracy55.6
197
Multimodal ReasoningMMMU (val)
Accuracy44
114
Optical Character RecognitionOCRBench
OCRBench Score79.8
83
Multimodal UnderstandingSEED-Bench Image
Accuracy72.9
82
Mathematical ReasoningMathVista mini
Accuracy65.4
72
Chart UnderstandingChartQA (test)
Accuracy82
52
Document UnderstandingDocVQA (test)
Accuracy91.7
39
Real-world QARealworldQA
Accuracy67.3
33
Multimodal UnderstandingMME
Score2.13e+3
22
Showing 10 of 13 rows

Other info

GitHub

Follow for update