Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

About

Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. Our method demonstrates superior performance in long-text-image retrieval tasks. The project page is available at https://wuw2019.github.io/lot-lip.

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha• 2024

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalDCI
R@162.1
79
Text-to-Image RetrievalDCI
R@161
79
Text-to-Image RetrievalSV-1k
R@186.8
33
Image-to-Text RetrievalSV-1k
R@195.5
23
Image-Text RetrievalDCI long-text--
22
Image-Text RetrievalMSCOCO (val)
T2I Recall@138.1
16
Image-Text RetrievalFlickr30k (val)
Text-to-Image Recall@165.2
16
Image-to-Text RetrievalShareGPT4V 1k
R@195.5
11
Text-to-Image RetrievalShareGPT4V 1k
Recall@186.8
11
Image-to-Text RetrievalSV-10k
R@181.4
10
Showing 10 of 14 rows

Other info

Follow for update