ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

About

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee• 2023

Related benchmarks

Task	Dataset	Result
Multiple-choice Visual Question Answering	EarthVLSet	OA79.78	21
Error detection	KITTI DK:test^noise noise-induced (test)	Recall61	15
Open-Ended Visual Question Answering	EarthVLSet 1.0 (test)	BLEU-10.5601	12
Tactile attribute description	TVL Benchmarks	SSVTP4.1	6
Placement assessment	Touch2Add 1.0 (test)	mIoU0.12	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord