Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

About

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	GQA	Accuracy54.5	1425
Visual Question Answering	GQA	Accuracy62	524
Visual Question Answering	TextVQA	TextVQA Accuracy61.5	210
Document Visual Question Answering	DocVQA	Accuracy57.4	203
GUI Agent Task	AndroidWorld	Success Rate25.4	188
Visual Mathematical Reasoning	MathVista (testmini)	Accuracy67.2	88
Multi-modal Question Answering	MMMU	Accuracy45.1	83
Computer Use	OSWorld	OS Success Rate20.83	45
GUI Navigation and Action	OS World (test)	Success Rate (Avg)4.59	41

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord