Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
About
Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Stanford Cars | Accuracy68.12 | 635 | |
| Image Classification | CUB-200 2011 | Accuracy63.19 | 356 | |
| Image Classification | Oxford Flowers 102 | Accuracy70.63 | 234 | |
| Image Classification | Oxford-IIIT Pet | Accuracy86.49 | 219 | |
| Image Classification | Stanford Dogs | Accuracy67.82 | 153 | |
| Image Classification | FGVC Aircraft | -- | 92 | |
| Scene recognition | SUN397 | Accuracy71.92 | 49 | |
| Recognition | ImageNet-1K | Top-1 Accuracy71.45 | 42 | |
| Image Recognition | Describable Textures Dataset (DTD) | Accuracy61.47 | 32 | |
| Visual Recognition | Food-101 | Top-1 Acc84.75 | 16 |