Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

About

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Science Question Answering	ScienceQA	Accuracy92.5	791
Text-to-Image Retrieval	Flickr30K	R@181.9	559
Multimodal Understanding	SEED-Bench	Accuracy78.3	516
Text-to-Video Retrieval	DiDeMo	--	465
Image-to-Text Retrieval	Flickr30K	R@192.9	451
Diagram Understanding	AI2D	Accuracy82.2	317
Visual Perception	BLINK	Accuracy68.2	241
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy62.4	216
Document Visual Question Answering	DocVQA	Accuracy92.2	203

Showing 10 of 110 rows

...

Other info

GitHub

Follow for update

@wizwand_team Discord