InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
About
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA (val) | VQA Score78.5 | 309 | |
| Multimodal Understanding | MMStar | Accuracy55.6 | 197 | |
| Multimodal Reasoning | MMMU (val) | Accuracy44 | 114 | |
| Optical Character Recognition | OCRBench | OCRBench Score79.8 | 83 | |
| Multimodal Understanding | SEED-Bench Image | Accuracy72.9 | 82 | |
| Mathematical Reasoning | MathVista mini | Accuracy65.4 | 72 | |
| Chart Understanding | ChartQA (test) | Accuracy82 | 52 | |
| Document Understanding | DocVQA (test) | Accuracy91.7 | 39 | |
| Real-world QA | RealworldQA | Accuracy67.3 | 33 | |
| Multimodal Understanding | MME | Score2.13e+3 | 22 |