HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

About

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

Md Jahidul Islam• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	Food101	--	457
Image Classification	StanfordCars	--	384
Image Classification	Caltech101	Base Accuracy98.41	148
Image Classification	EuroSAT	Base Accuracy95.43	104
Base-to-New Generalization	Avg over 11 datasets	Base Score84.29	102
Action Recognition	UCF101	Base Accuracy85.73	75
Image Classification	Oxford Pets	Base Accuracy95.71	60
Image Classification	FGVC Aircraft	Base Accuracy42.38	38
Image Classification	ImageNet source to 10 fine-grained target datasets (test)	Caltech101 Accuracy94.81	37
Image Classification	SUN397	Base Accuracy81.9	25

Showing 10 of 13 rows

Other info

GitHub

Follow for update

@wizwand_team Discord