Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

About

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy78.1
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy87.3
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.899
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy78.8
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy78.3
300
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy68.1
244
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy83.8
216
Referring Expression ComprehensionRefCOCO (testB)
Accuracy83.3
205
Referring Expression ComprehensionRefCOCO v1 (val)
Top-1 Accuracy88.14
49
Referring Expression ComprehensionRefCOCO, RefCOCO+, and RefCOCOg Average
Average Accuracy80.9
44
Showing 10 of 19 rows

Other info

Follow for update