HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
About
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO v1 (val) | Top-1 Accuracy88.14 | 49 | |
| Visual Grounding | RefFLIR 1.0 (val) | Accuracy @ 0.5 IoU69.08 | 29 | |
| Visual Grounding | RefFLIR RGBT-Ground (val) | Acc@0.50.7533 | 10 | |
| Visual Grounding | RefMFAD RGBT-Ground (test) | Accuracy @ 0.5 IoU67.04 | 10 | |
| Visual Grounding | RefFLIR RGBT-Ground (test) | Accuracy @ 0.5 IoU72.5 | 10 | |
| Visual Grounding | RefM3FD RGBT-Ground (val) | Acc@0.569.64 | 10 | |
| Visual Grounding | RefM3FD RGBT-Ground (test) | Accuracy @ 0.572.35 | 10 | |
| Visual Grounding | RefMFAD RGBT-Ground (val) | Acc@0.50.6707 | 10 | |
| Visual Grounding | RefMFAD 1.0 (testC) | Acc@0.545.69 | 3 | |
| Visual Grounding | RefM3FD 1.0 (test) | Accuracy@0.553.1 | 3 |