Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

About

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationCUB-200 (test)--
62
Image ClassificationCARS196 (test)--
38
Conditional Image RetrievalGeneCIS Focus Attribute (test)
Recall@124
6
Conditional Image RetrievalGeneCIS Focus Object (test)
Recall@121.1
6
Object SimilarityDomainNet
MAP@10.7255
3
Style SimilarityDomainNet
MAP@183.57
3
Style SimilarityWikiArt
MAP@145.6
3
Conditional Image EmbeddingDeepFashion
AMI (Clothing)0.429
2
Conditional Image EmbeddingSynthetic Cars
AMI (Mod.)0.823
2
Showing 9 of 9 rows

Other info

Follow for update