MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

About

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Kaizhi Zheng, Xuehai He, Xin Eric Wang• 2023

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy2.8	962
Chart Question Answering	ChartQA	Accuracy1.4	371
Table Question Answering	WTQ	Accuracy0.9	101
Document Visual Question Answering	InfoVQA	Accuracy0.021	85
Document-oriented Visual Question Answering	DocVQA	Accuracy1.6	84
Interleaved Image-Text Generation	OpenING	Completeness3.91	27
Text-to-Image Generation	MARIO-Eval	CLIPScore0.25	25
Text-Centric Vision-Language Understanding	OCR Bench	Accuracy68	20
Scene Text-Centric Visual Question Answering	STVQA	Accuracy0.024	20
Interleaved Image-Text Generation	WeaverBench	FDT30.2	15

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord