SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

About

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.

Christoph Timmermann, Hyunse Lee, Woojin Lee• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	--	749
Image Classification	ImageNet	Top-1 Accuracy73.98	343
Image Classification	ImageNet-Sketch	Accuracy50.44	63
Few-shot Image Classification	Average 11 datasets (test)	Average Accuracy (Few-shot)78.15	47
Text-to-Text Retrieval	NLP Retrieval Benchmarks standard (test)	IMDB Retrieval Score57.42	4
Image-to-Image Retrieval	CLIP Evaluation Suite OxfordPets, Flowers102, FGVCAircraft, DTD, EuroSAT, StanfordCars, SUN397, Caltech101, UCF101 (test)	OxfordPets Accuracy36.96	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord