Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
About
This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Image Segmentation | RefCOCO (val) | -- | 259 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | -- | 252 | |
| Referring Image Segmentation | RefCOCO (test A) | -- | 230 | |
| Referring Image Segmentation | RefCOCO+ (val) | -- | 179 | |
| Referring Image Segmentation | RefCOCO (test-B) | -- | 171 | |
| Generalized Referring Expression Segmentation | gRefCOCO (testA) | cIoU75.2 | 139 | |
| Generalized Referring Expression Segmentation | gRefCOCO (val) | cIoU68 | 123 | |
| Generalized Referring Expression Segmentation | gRefCOCO (testB) | cIoU73.1 | 121 | |
| Referring Image Segmentation | RefCOCOg (val) | -- | 100 | |
| Referring Image Segmentation | RefCOCO+ (testA) | -- | 97 |