Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

About

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao• 2023

Related benchmarks

TaskDatasetResultRank
GUI Agent TaskAndroidWorld
Success Rate25.4
104
GUI Navigation and ActionOS World (test)
Success Rate (OS)20.83
26
Computer UseOSWorld
OS Success Rate20.83
22
Allocentric Spatial ReasoningCOMFORT#
Left/Right Accuracy46.58
19
Allocentric Spatial Reasoning3DSRBench
Left/Right Acc37.54
19
Online web agent task success rateMobileMiniWob++
Task Success Rate67.7
12
Element GroundingMultimodal-Mind2Web (out-of-distribution)
Cross-Task Generalization29.6
10
UI ControlAndroidControl
Success Rate45
2
Showing 8 of 8 rows

Other info

Follow for update