MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

About

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

Kaixiang Chen, Pengfei Fang, Hui Xue• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	StanfordCars	Accuracy66.7	384
Image Classification	OxfordPets	Accuracy91.37	298
Image Classification	FGVCAircraft	Accuracy26.87	289
Image Classification	OxfordPets	H Score97.14	182
Image Classification	Food101	Accuracy86.63	177
Image Classification	UCF101	Base Classes Acc87.9	139
Image Classification	DTD	Accuracy46.37	87
Image Classification	Caltech101	Top-1 Accuracy (Caltech101)94.73	84
Image Classification	Food101	Base Accuracy91.1	69
Image Classification	Caltech101	Base Accuracy98.8	68

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord