Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

About

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy84.7	2019
Visual Question Answering	VQA v2	Accuracy77.4	1429
Visual Question Answering	GQA	Accuracy58.8	1425
Multimodal Understanding	MMBench	Accuracy58.8	847
Multimodal Evaluation	MME	--	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy83.3	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.175	706
Object Detection	COCO (val)	--	637
Semantic segmentation	Cityscapes (val)	mIoU17.77	527
Multimodal Understanding	SEED-Bench	Accuracy58.8	516

Showing 10 of 188 rows

...

Other info

Code

Follow for update

@wizwand_team Discord