Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
About
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CUB-200 (test) | -- | 62 | |
| Image Classification | CARS196 (test) | -- | 38 | |
| Conditional Image Retrieval | GeneCIS Focus Attribute (test) | Recall@124 | 6 | |
| Conditional Image Retrieval | GeneCIS Focus Object (test) | Recall@121.1 | 6 | |
| Object Similarity | DomainNet | MAP@10.7255 | 3 | |
| Style Similarity | DomainNet | MAP@183.57 | 3 | |
| Style Similarity | WikiArt | MAP@145.6 | 3 | |
| Conditional Image Embedding | DeepFashion | AMI (Clothing)0.429 | 2 | |
| Conditional Image Embedding | Synthetic Cars | AMI (Mod.)0.823 | 2 |