Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

About

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi• 2023

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy52.7
1460
Visual Question AnsweringVQA v2
Accuracy79.4
1165
Object Hallucination EvaluationPOPE
Accuracy87.7
935
Multi-task Language UnderstandingMMLU
Accuracy30.4
842
Question AnsweringARC Challenge
Accuracy33.5
749
Image CaptioningMS COCO Karpathy (test)--
682
Video Question AnsweringMSRVTT-QA
Accuracy41.5
481
Question AnsweringARC Easy
Normalized Acc55.3
385
Multimodal UnderstandingMMBench
Accuracy71.5
367
Video Question AnsweringMSVD-QA
Accuracy52.1
340
Showing 10 of 76 rows
...

Other info

Code

Follow for update