SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
About
Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Stanford Cars | Accuracy99.34 | 635 | |
| Image Classification | CUB-200 2011 | Accuracy90.76 | 356 | |
| Image Classification | Oxford Flowers 102 | Accuracy88.31 | 234 | |
| Image Classification | Oxford-IIIT Pet | Accuracy95.38 | 219 | |
| Image Classification | Stanford Dogs | Accuracy84.29 | 153 | |
| Image Classification | FGVC Aircraft | -- | 92 | |
| Scene recognition | SUN397 | Accuracy80.05 | 49 | |
| Recognition | ImageNet-1K | Top-1 Accuracy85.06 | 42 | |
| Image Recognition | Describable Textures Dataset (DTD) | Accuracy83.1 | 32 | |
| Visual Recognition | Food-101 | Top-1 Acc88.02 | 16 |