HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

About

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu• 2024

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy78.1	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy87.3	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.899	346
Referring Expression Comprehension	RefCOCOg (test)	Accuracy78.8	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy78.3	300
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy68.1	244
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy83.8	216
Referring Expression Comprehension	RefCOCO (testB)	Accuracy83.3	213
Referring Expression Comprehension	RefCOCO v1 (val)	Top-1 Accuracy88.14	49
Referring Expression Comprehension	RefCOCO, RefCOCO+, and RefCOCOg Average	Average Accuracy80.9	44

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord