Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

About

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringGQA
Accuracy54.5
1425
Visual Question AnsweringGQA
Accuracy62
524
Visual Question AnsweringTextVQA
TextVQA Accuracy61.5
210
Document Visual Question AnsweringDocVQA
Accuracy57.4
203
GUI Agent TaskAndroidWorld
Success Rate25.4
188
Visual Mathematical ReasoningMathVista (testmini)
Accuracy67.2
88
Multi-modal Question AnsweringMMMU
Accuracy45.1
83
Computer UseOSWorld
OS Success Rate20.83
45
GUI Navigation and ActionOS World (test)
Success Rate (Avg)4.59
41
Showing 10 of 29 rows

Other info

Follow for update