Qwen3-VL Technical Report
About
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy90.1 | 1455 | |
| Visual Question Answering | VQA v2 | Accuracy74.1 | 1362 | |
| Mathematical Reasoning | MATH | Accuracy89.24 | 882 | |
| Text-based Visual Question Answering | TextVQA | Accuracy82.1 | 807 | |
| Multimodal Evaluation | MME | Score2.00e+3 | 658 | |
| Multimodal Understanding | MMBench | Accuracy90.6 | 637 | |
| Instruction Following | IFEval | IFEval Accuracy88.2 | 625 | |
| Visual Question Answering | GQA | Accuracy71.9 | 505 | |
| Science Question Answering | ScienceQA | Accuracy94.94 | 502 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score79.4 | 431 |