Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
About
Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Stanford Cars | Accuracy79.26 | 635 | |
| Image Classification | CUB-200 2011 | Accuracy60.26 | 356 | |
| Image Classification | Oxford Flowers 102 | Accuracy74.23 | 234 | |
| Image Classification | Oxford-IIIT Pet | Accuracy86.17 | 219 | |
| Image Classification | Stanford Dogs | Accuracy64.54 | 153 | |
| Image Classification | FGVC Aircraft | -- | 92 | |
| Scene recognition | SUN397 | Accuracy69.96 | 49 | |
| Recognition | ImageNet-1K | Top-1 Accuracy73.98 | 42 | |
| Image Recognition | Describable Textures Dataset (DTD) | Accuracy53.24 | 32 | |
| Visual Recognition | Food-101 | Top-1 Acc84.51 | 16 |