SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

About

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H\'enaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU45.4	3089
Visual Question Answering	TextVQA	Accuracy74.3	1455
Visual Question Answering	GQA	Accuracy65.2	1445
Image Classification	ImageNet-1K	Top-1 Acc88	1239
Semantic segmentation	ADE20K	mIoU51.6	1028
Image Classification	CIFAR-10	--	973
Image Classification	ImageNet 1k (test)	Top-1 Accuracy76.79	939
Multimodal Evaluation	MME	Score1.73e+3	902
Image Classification	ImageNet V2	Top-1 Acc79.8	767
Image Classification	ImageNet A	Top-1 Acc90.5	723

Showing 10 of 567 rows

...

Other info

Follow for update

@wizwand_team Discord