Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

About

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars
Accuracy85.3
635
Image ClassificationEuroSAT
Accuracy89.4
569
Image ClassificationFlowers102
Accuracy98.3
558
Image ClassificationFood101
Accuracy87.5
457
Image ClassificationSUN397
Accuracy76.2
441
Image ClassificationCaltech101
Accuracy96.3
228
Image ClassificationImageNet V2 (test)--
216
Image ClassificationImageNet-A (test)--
175
Image ClassificationOxfordPets
Accuracy93.4
160
Image ClassificationImageNet-Sketch (test)--
153
Showing 10 of 17 rows

Other info

Follow for update