Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

About

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, showing the rosy potential of the technical direction. Codes and checkpoints are available at https://github.com/Alpha-VLLM/Lumina-mGPT.

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, Peng Gao• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
GenEval Score56
277
Text-to-Image GenerationDPG
Overall Score79.98
131
Text-to-Image GenerationDPG-Bench
DPG Score79.7
89
Text-to-Image GenerationPartiPrompt
Latency (s)87.25
15
Text-to-Image GenerationGenEval
Single Object Accuracy100
11
Image GenerationImage Generation Dataset
CLIP Score0.333
7
Showing 6 of 6 rows

Other info

Follow for update