DetailCLIP: Injecting Image Details into CLIP's Feature Space
About
Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). Our proposed framework addresses this issue by generating a single feature representation for a high-resolution image that retains image details from different scales while sharing the same semantic space as the original CLIP. An application scenario is remote sensing text-image retrieval, where targets (e.g., vehicles and ships) often appear at tiny scales. To achieve this, we develop a feature fusion model that relies on CLIP features extracted from a carefully designed image patch method, dubbed Complete Cover. This method ensures comprehensive coverage of objects across various scales and is weakly supervised by image-agnostic class prompted queries. We evaluate our framework's performance using real-world and synthetic datasets, demonstrating significant improvements in image retrieval tasks based on class prompted queries. To further showcase our framework's capability in detail retrieval, we introduce a CLEVR-like synthetic dataset, named CLVER-DS. This fully annotated dataset offers a controllable object scale, allowing for a more thorough examination of our approach's effectiveness.Our code is publicly available at https://github.com/zilunzhang/DetailCLIP
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-Text Retrieval | COCO | Recall@162.63 | 27 | |
| Image-Text Retrieval | CLEVR-DS | Recall@133.46 | 12 | |
| Image-Text Retrieval | Unity | Recall@155.21 | 12 | |
| Image-Text Retrieval | LVIS | Recall@115.29 | 12 | |
| Text-to-Image Retrieval | CLEVR-DS-S (test) | Recall@114.66 | 3 | |
| Text-to-Image Retrieval | CLEVR-DS-L (test) | Recall@116.33 | 3 | |
| Text-to-Image Retrieval | CLEVR-DS (test) | Recall@122.54 | 3 |