EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

About

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.

Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy61.5	1453
Visual Question Answering	GQA	Accuracy44.6	1425
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy80.5	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.457	706
Visual Question Answering	OK-VQA	Accuracy45.2	272
Multimodal Understanding	MMBench (test)	Accuracy69.2	67
Text-to-Image Generation	MS-COCO 256x256 (val)	FID7.68	64
Text-to-Image Generation	ImagenHub (test)	CLIP-T Score0.282	8
Image Generation	PhotoChat	FID9.72	4
Response Generation	PhotoChat	BLEU-123.6	4

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord