Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

About

CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@149.6
531
Image-to-Text RetrievalFlickr30K
R@160.5
429
Text-to-Image RetrievalMS-COCO
R@149.1
151
Image-to-Text RetrievalMS-COCO
R@162.3
132
Text-to-Image RetrievalDCI
R@166.7
79
Image-to-Text RetrievalDCI
R@165.1
79
Text-to-Image RetrievalUrban-1K--
40
Image-to-Text RetrievalUrban1k
R@186.8
36
Image-to-Text RetrievalUrban-1K--
34
Text-to-Image RetrievalUrban1k
R@187.7
28
Showing 10 of 12 rows

Other info

Follow for update