Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

About

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

Md Jahidul Islam• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationFood101--
457
Image ClassificationStanfordCars--
384
Image ClassificationCaltech101
Base Accuracy98.41
148
Image ClassificationEuroSAT
Base Accuracy95.43
104
Base-to-New GeneralizationAvg over 11 datasets
Base Score84.29
102
Action RecognitionUCF101
Base Accuracy85.73
75
Image ClassificationOxford Pets
Base Accuracy95.71
60
Image ClassificationFGVC Aircraft
Base Accuracy42.38
38
Image ClassificationImageNet source to 10 fine-grained target datasets (test)
Caltech101 Accuracy94.81
37
Image ClassificationSUN397
Base Accuracy81.9
25
Showing 10 of 13 rows

Other info

GitHub

Follow for update