Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

About

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy85.3	660
Image Classification	EuroSAT	Accuracy89.4	569
Image Classification	Flowers102	Accuracy98.3	558
Image Classification	Food101	Accuracy87.5	457
Image Classification	SUN397	Accuracy76.2	450
Image Classification	OxfordPets	Accuracy93.4	298
Image Classification	ImageNet V2 (test)	--	232
Image Classification	Caltech101	Accuracy96.3	228
Image Classification	ImageNet-A (test)	--	177
Image Classification	ImageNet-R (test)	Accuracy77.4	170

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord