Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

About

Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.

Canshi Wei• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy79.26	660
Image Classification	CUB-200 2011	Accuracy60.26	374
Image Classification	Oxford Flowers 102	Accuracy74.23	234
Image Classification	Oxford-IIIT Pet	Accuracy86.17	219
Image Classification	Stanford Dogs	Accuracy64.54	153
Image Classification	FGVC Aircraft	--	112
Scene recognition	SUN397	Accuracy69.96	49
Recognition	ImageNet-1K	Top-1 Accuracy73.98	42
Image Recognition	Describable Textures Dataset (DTD)	Accuracy53.24	32
Visual Recognition	Food-101	Top-1 Acc84.51	16

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord