Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

About

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP

Yunqi Hong, Sohyun An, Andrew Bai, Neil Y.C. Lin, Cho-Jui Hsieh• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars
Accuracy68.12
635
Image ClassificationCUB-200 2011
Accuracy63.19
356
Image ClassificationOxford Flowers 102
Accuracy70.63
234
Image ClassificationOxford-IIIT Pet
Accuracy86.49
219
Image ClassificationStanford Dogs
Accuracy67.82
153
Image ClassificationFGVC Aircraft--
92
Scene recognitionSUN397
Accuracy71.92
49
RecognitionImageNet-1K
Top-1 Accuracy71.45
42
Image RecognitionDescribable Textures Dataset (DTD)
Accuracy61.47
32
Visual RecognitionFood-101
Top-1 Acc84.75
16
Showing 10 of 12 rows

Other info

Follow for update