InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
About
Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7$\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \textbf{InfiniteVL-Offline} for offline retrieval and \textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | ChartQA | Accuracy82 | 519 | |
| Optical Character Recognition | OCRBench | Score79.8 | 433 | |
| Multimodal Understanding | MMStar | Accuracy55.6 | 407 | |
| Visual Question Answering | TextVQA (val) | VQA Score78.5 | 365 | |
| Visual Question Answering | AI2D | Accuracy77.2 | 317 | |
| Visual Question Answering | RealworldQA | Accuracy67.3 | 259 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy44 | 212 | |
| Visual Question Answering | TextVQA | TextVQA Accuracy78.5 | 210 | |
| Visual Question Answering | DocVQA | Accuracy91.7 | 205 | |
| Multimodal Reasoning | MMMU (val) | Accuracy44 | 168 |