Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

About

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

Md Jahidul Islam• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationFood101--
457
Image ClassificationStanfordCars--
312
Image ClassificationCaltech101
Base Accuracy98.41
148
Image ClassificationEuroSAT
Base Accuracy95.43
104
Base-to-New GeneralizationAvg over 11 datasets
Base Score84.29
90
Action RecognitionUCF101
Base Accuracy85.73
75
Image ClassificationOxford Pets
Base Accuracy95.71
60
Image ClassificationImageNet source to 10 fine-grained target datasets (test)
Caltech101 Accuracy94.81
30
Image ClassificationFGVC Aircraft
Base Accuracy42.38
25
Image ClassificationSUN397
Base Accuracy81.9
25
Showing 10 of 13 rows

Other info

GitHub

Follow for update