Language-driven Fine-grained Retrieval

About

Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer

Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Pengfei Zhang, Zi Huang• 2025

Related benchmarks

Task	Dataset	Result
Image Retrieval	CUB-200-2011 (test)	Recall@187.2	251
Image Retrieval	Stanford Online Products (test)	Recall@187.1	231
Image Retrieval	CUB-200 2011	Recall@187.2	163
Image Retrieval	Stanford Online Products	Recall@187.1	64
Image Retrieval	Stanford Cars 196	Recall@191.5	17
Image Retrieval	Stanford Cars 196 (test)	Recall@191.5	16

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord