Qwen3-VL Technical Report
About
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89.7 | 935 | |
| Mathematical Reasoning | MATH | Accuracy89.24 | 643 | |
| Multimodal Evaluation | MME | Score2.00e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy80.34 | 496 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy82.9 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | -- | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | -- | 333 | |
| Mathematical Reasoning | MathVista | Score61.3 | 322 | |
| OCR Evaluation | OCRBench | Score863 | 296 | |
| Visual Question Answering | OK-VQA (test) | Accuracy44.4 | 296 |