Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

About

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.

Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy61.5
1117
Visual Question AnsweringGQA
Accuracy44.6
963
Image CaptioningMS COCO Karpathy (test)
CIDEr1.457
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80.5
664
Visual Question AnsweringOK-VQA
Accuracy45.2
224
Multimodal UnderstandingMMBench (test)--
65
Text-to-Image GenerationMS-COCO 256x256 (val)
FID7.68
53
Text-to-Image GenerationImagenHub (test)
CLIP-T Score0.282
8
Image GenerationPhotoChat
FID9.72
4
Response GenerationPhotoChat
BLEU-123.6
4
Showing 10 of 10 rows

Other info

Code

Follow for update