AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

About

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, Xipeng Qiu• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER8.5	1410
Image Captioning	MS COCO Karpathy (test)	CIDEr1.075	706
Multimodal Understanding	SEED-Bench	--	571
Video Question Answering	VideoMME	Accuracy29.8	254
Text-to-Image Generation	MS-COCO (val)	--	215
Video Question Answering	EgoSchema	Accuracy32.1	194
Text-to-Image Generation	MS-COCO	--	193
Visual Question Answering	POPE	Accuracy67.7	136
Text-to-Speech	LibriSpeech clean (test)	WER27.1	97
Video Question Answering	MVBench	Accuracy33.2	90

Showing 10 of 26 rows

Other info

Code

Follow for update

@wizwand_team Discord