Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

About

Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3\% and 2\%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Ahmed• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy79.6
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy89.4
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.91
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy81.8
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy81.7
300
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy71.9
244
Referring Expression SegmentationRefCOCO+ (testA)--
230
Referring Expression SegmentationRefCOCO+ (val)--
223
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy84.7
216
Showing 10 of 20 rows

Other info

Follow for update