SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

About

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan, Kaixiang Yao, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy99.34	660
Image Classification	CUB-200 2011	Accuracy90.76	374
Image Classification	Oxford Flowers 102	Accuracy88.31	234
Image Classification	Oxford-IIIT Pet	Accuracy95.38	219
Image Classification	Stanford Dogs	Accuracy84.29	153
Image Classification	FGVC Aircraft	--	112
Scene recognition	SUN397	Accuracy80.05	49
Recognition	ImageNet-1K	Top-1 Accuracy85.06	42
Image Recognition	Describable Textures Dataset (DTD)	Accuracy83.1	32
Visual Recognition	Food-101	Top-1 Acc88.02	16

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord