Meta-Transformer: A Unified Framework for Multimodal Learning

About

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue• 2023

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	Accuracy37.3	844
Action Recognition	UCF101	--	433
Text-to-Video Retrieval	MSR-VTT	Recall@131.5	406
Audio Classification	AudioSet 2M	mAP38.9	98
Natural Language Understanding	GLUE (test dev)	MRPC Accuracy81.8	90
Hyperspectral Image Classification	Indian Pines	Overall Accuracy (OA)0.781	69
Image Classification	Places365	Top-1 Accuracy52.7	67
Graph property prediction	PCQM4M-LSC (val)	MAE0.8863	48
Video Classification	Kinetics 700	Top-1 Accuracy33.2	46
Audio Recognition	Speech Commands V2	Accuracy97	43

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord