Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

About

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars
Accuracy99.34
635
Image ClassificationCUB-200 2011
Accuracy90.76
356
Image ClassificationOxford Flowers 102
Accuracy88.31
234
Image ClassificationOxford-IIIT Pet
Accuracy95.38
219
Image ClassificationStanford Dogs
Accuracy84.29
153
Image ClassificationFGVC Aircraft--
92
Scene recognitionSUN397
Accuracy80.05
49
RecognitionImageNet-1K
Top-1 Accuracy85.06
42
Image RecognitionDescribable Textures Dataset (DTD)
Accuracy83.1
32
Visual RecognitionFood-101
Top-1 Acc88.02
16
Showing 10 of 12 rows

Other info

Follow for update