HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models
About
Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Food101 | -- | 457 | |
| Image Classification | StanfordCars | -- | 312 | |
| Image Classification | Caltech101 | Base Accuracy98.41 | 148 | |
| Image Classification | EuroSAT | Base Accuracy95.43 | 104 | |
| Base-to-New Generalization | Avg over 11 datasets | Base Score84.29 | 90 | |
| Action Recognition | UCF101 | Base Accuracy85.73 | 75 | |
| Image Classification | Oxford Pets | Base Accuracy95.71 | 60 | |
| Image Classification | ImageNet source to 10 fine-grained target datasets (test) | Caltech101 Accuracy94.81 | 30 | |
| Image Classification | FGVC Aircraft | Base Accuracy42.38 | 25 | |
| Image Classification | SUN397 | Base Accuracy81.9 | 25 |