Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

About

Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score $S_{\textit{GMP}}$, leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.

Yimu Wang, Evelien Riddell, Adrian Chow, Sean Sedwards, Krzysztof Czarnecki• 2025

Related benchmarks

TaskDatasetResultRank
Out-of-Distribution DetectionSUN OOD with ImageNet-1k In-distribution (test)
FPR@9519.4
204
Out-of-Distribution DetectionImageNet-1k ID iNaturalist OOD
FPR958.27
132
Out-of-Distribution DetectionImageNet-1k Textures ID OOD
AUROC94.45
85
Out-of-Distribution DetectionPlaces OOD ImageNet-1k ID
AUROC93.56
45
Showing 4 of 4 rows

Other info

Follow for update