Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RegionGPT: Towards Region Understanding Vision Language Model

About

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short as RGPT), a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases, while maintaining the model's versatility for general-purpose tasks. Additionally, we develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCOg (test)
Accuracy86.96
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy86.44
291
Object HallucinationPOPE (Random)
F1 Score86.85
200
Object HallucinationPOPE Adversarial
Accuracy85.67
196
Object HallucinationPOPE Popular
F1 Score85.92
188
Object ClassificationCOCO 2017 (val)
Accuracy80.61
23
Region-level captioningRefCOCOg
METEOR16.9
21
Region-level captioningRefCOCOg (test)
CIDEr109.9
18
Spatial ReasoningSpatialRGPT-Bench qualitative 1.0 (val test)
Below/Above Accuracy30.83
11
Region CaptioningDLC-Bench
Pos. Score10.6
10
Showing 10 of 13 rows

Other info

Follow for update