Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

About

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Kaizhi Zheng, Xuehai He, Xin Eric Wang• 2023

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy2.8
496
Chart Question AnsweringChartQA
Accuracy1.4
229
Table Question AnsweringWTQ
Accuracy0.9
101
Document-oriented Visual Question AnsweringDocVQA
Accuracy1.6
72
Document Visual Question AnsweringInfoVQA--
32
Text-to-Image GenerationMARIO-Eval
CLIPScore0.25
25
Text-Centric Vision-Language UnderstandingOCR Bench
Accuracy68
20
Scene Text-Centric Visual Question AnsweringOCRVQA
Accuracy2.3
14
Scene Text-Centric Visual Question AnsweringSTVQA
Accuracy0.024
14
Visual Text EditingAnyText benchmark-EN (test)
NED0.02
8
Showing 10 of 16 rows

Other info

Follow for update