MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
About
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | Accuracy2.8 | 496 | |
| Chart Question Answering | ChartQA | Accuracy1.4 | 229 | |
| Table Question Answering | WTQ | Accuracy0.9 | 101 | |
| Document-oriented Visual Question Answering | DocVQA | Accuracy1.6 | 72 | |
| Document Visual Question Answering | InfoVQA | -- | 32 | |
| Text-to-Image Generation | MARIO-Eval | CLIPScore0.25 | 25 | |
| Text-Centric Vision-Language Understanding | OCR Bench | Accuracy68 | 20 | |
| Scene Text-Centric Visual Question Answering | OCRVQA | Accuracy2.3 | 14 | |
| Scene Text-Centric Visual Question Answering | STVQA | Accuracy0.024 | 14 | |
| Visual Text Editing | AnyText benchmark-EN (test) | NED0.02 | 8 |