Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FG-CLIP: Fine-Grained Visual and Textual Alignment

About

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@176.4
460
Text-to-Image RetrievalFlickr30K 1K (test)
R@181.3
375
Text-to-Image RetrievalMSCOCO 5K (test)
R@150.46
286
Image-to-Text RetrievalDCI
R@161.8
68
Text-to-Image RetrievalDCI
R@160.6
68
Image ClassificationCovidx
Accuracy36.3
57
Text-to-Image RetrievalMSCOCO (5K)
R@145.4
42
Image-to-Text RetrievalUrban-1K
R@193
34
Text-to-Image RetrievalUrban-1K
R@189.9
34
Text-to-Image RetrievalSV-1k
R@194.9
33
Showing 10 of 31 rows

Other info

Follow for update