DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition
About
Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios. Our code is available $\href{https://github.com/raja-kumar/DiVE-k}{here}$
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Image Classification | CUB-200 | Accuracy (All)76.8 | 39 | |
| Fine-grained Image Classification | Oxford Flowers 102 | Accuracy88.7 | 33 | |
| Fine-grained Image Classification | Stanford Cars | Base Accuracy69 | 27 | |
| Fine-grained visual classification | Oxford-IIIT Pet (test) | -- | 10 | |
| Fine-grained Image Classification | FGVC Aircraft 100 | Accuracy69.1 | 7 | |
| Fine-grained Image Classification | FGVC-Aircraft (test) | Base Accuracy68.1 | 7 | |
| Fine-grained Image Classification | Average (5 datasets) Macro-average (test) | Base Accuracy80.8 | 7 | |
| Fine-grained Image Classification | Flowers | Base Accuracy97.4 | 7 | |
| Fine-grained Image Classification | Oxford Flowers-102 (test) | Base Accuracy (B)97.4 | 7 | |
| Fine-grained Image Classification | Aircraft | Base Accuracy65.5 | 7 |