Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
About
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy62 | 505 | |
| GUI Agent Task | AndroidWorld | Success Rate25.4 | 136 | |
| Computer Use | OSWorld | OS Success Rate20.83 | 42 | |
| GUI Navigation and Action | OS World (test) | Success Rate (OS)20.83 | 26 | |
| Visual Question Answering | VQA v1 | Accuracy72.8 | 25 | |
| Referring Expression Comprehension | RefCOCOg | Accuracy55.5 | 21 | |
| Allocentric Spatial Reasoning | COMFORT# | Left/Right Accuracy46.58 | 19 | |
| Allocentric Spatial Reasoning | 3DSRBench | Left/Right Acc37.54 | 19 | |
| Grounded Task Planning | GroundedPlanBench Explicit Instructions, Short Horizon | TSR14.7 | 15 | |
| Grounded Task Planning | GroundedPlanBench Explicit Instructions, Medium Horizon | TSR2 | 15 |