Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

About

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong• 2025

Related benchmarks

TaskDatasetResultRank
Few-shot Image ClassificationAves
Accuracy69.4
22
Fine-grained species classificationiNaturalist Aves 16-shot 2018 (test)
Accuracy69.4
18
Fine-grained species classificationInsecta Species196 16-shot (test)
Accuracy70.8
18
Fine-grained species classificationWeeds Species196 16-shot (test)
Accuracy87.7
18
Fine-grained species classificationMollusca Species196 16-shot (test)
Accuracy71.6
18
Fine-grained species classificationFungi FungiTastic 16-shot (test)
Accuracy31.1
18
Image ClassificationFungi--
18
Few-shot Image ClassificationFungi
Accuracy15
8
Visual Species RecognitionAves--
6
Visual Species RecognitionInsecta--
6
Showing 10 of 15 rows

Other info

Follow for update