Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

About

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CUB-200 (test)	--	113
Image Classification	CARS196 (test)	--	38
Conditional Image Retrieval	GeneCIS Focus Attribute (test)	Recall@124	6
Conditional Image Retrieval	GeneCIS Focus Object (test)	Recall@121.1	6
Object Similarity	DomainNet	MAP@10.7255	3
Style Similarity	DomainNet	MAP@183.57	3
Style Similarity	WikiArt	MAP@145.6	3
Conditional Image Embedding	DeepFashion	AMI (Clothing)0.429	2
Conditional Image Embedding	Synthetic Cars	AMI (Mod.)0.823	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord