Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

About

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

Kaixiang Chen, Pengfei Fang, Hui Xue• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanfordCars
Accuracy66.7
384
Image ClassificationOxfordPets
Accuracy91.37
298
Image ClassificationFGVCAircraft
Accuracy26.87
289
Image ClassificationOxfordPets
H Score97.14
182
Image ClassificationFood101
Accuracy86.63
177
Image ClassificationUCF101
Base Classes Acc87.9
139
Image ClassificationCaltech101
Top-1 Accuracy (Caltech101)94.73
84
Image ClassificationDTD
Accuracy46.37
75
Image ClassificationFood101
Base Accuracy91.1
69
Image ClassificationCaltech101
Base Accuracy98.8
68
Showing 10 of 32 rows

Other info

Follow for update