Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

About

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@185
460
Image-to-Text RetrievalFlickr30K
R@195.9
379
Text-to-Image RetrievalCOCO
Recall@156.7
130
Image-to-Text RetrievalCOCO
R@174.6
123
Image-to-Text RetrievalFlickr30K-CN
R@191.5
99
Text-to-Image RetrievalFlickr30K-CN
R@177.2
99
Image-to-Text RetrievalDCI
R@170.6
68
Text-to-Image RetrievalDCI
R@172.1
68
Text-to-Image RetrievalCOCO-CN
R@168.1
49
Image-to-Text RetrievalCOCO-CN
R@183.2
48
Showing 10 of 18 rows

Other info

Follow for update