ToolFG: Towards Well-Grounded Fine-Grained Image Classification
About
Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Image Classification | CUB-200 | Accuracy (All)83 | 39 | |
| Fine-grained Image Classification | Oxford Flowers 102 | Accuracy95.8 | 33 | |
| Fine-grained Image Classification | Stanford Cars | Base Accuracy90 | 27 | |
| Fine-grained visual classification | Oxford-IIIT Pet (test) | -- | 10 | |
| Fine-grained Image Classification | Oxford Pets-37 | Accuracy97.3 | 7 | |
| Fine-grained Image Classification | Stanford Cars 196 | Accuracy83.4 | 7 | |
| Fine-grained Image Classification | FGVC Aircraft 100 | Accuracy76.6 | 7 | |
| Fine-grained Image Classification | Oxford Flowers-102 (test) | Base Accuracy (B)99.3 | 7 | |
| Fine-grained Image Classification | FGVC-Aircraft (test) | Base Accuracy75.6 | 7 | |
| Fine-grained Image Classification | Average (5 datasets) Macro-average (test) | Base Accuracy89.3 | 7 |