OneLLM: One Framework to Align All Modalities with Language

About

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy45.9	1820
Visual Question Answering	TextVQA	Accuracy34	1453
Visual Question Answering	VQA v2	Accuracy71.6	1429
Visual Question Answering	GQA	Accuracy59.5	1425
Multimodal Understanding	MMBench	Accuracy60	847
Multimodal Evaluation	MME	Score1.39e+3	727
Multimodal Understanding	MM-Vet	MM-Vet Score29.1	631
Multimodal Understanding	SEED-Bench	--	516
Visual Question Answering	ScienceQA	Accuracy63.4	446
Multimodal Capability Evaluation	MM-Vet	Score29.1	393

Showing 10 of 81 rows

...

Other info

Code

Follow for update

@wizwand_team Discord