Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

About

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy62
505
GUI Agent TaskAndroidWorld
Success Rate25.4
136
Computer UseOSWorld
OS Success Rate20.83
42
GUI Navigation and ActionOS World (test)
Success Rate (OS)20.83
26
Visual Question AnsweringVQA v1
Accuracy72.8
25
Referring Expression ComprehensionRefCOCOg
Accuracy55.5
21
Allocentric Spatial ReasoningCOMFORT#
Left/Right Accuracy46.58
19
Allocentric Spatial Reasoning3DSRBench
Left/Right Acc37.54
19
Grounded Task PlanningGroundedPlanBench Explicit Instructions, Short Horizon
TSR14.7
15
Grounded Task PlanningGroundedPlanBench Explicit Instructions, Medium Horizon
TSR2
15
Showing 10 of 21 rows

Other info

Follow for update