Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

About

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy57.4	1820
Visual Question Answering	VQA v2	Accuracy77.9	1429
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy77.9	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.223	706
Image Classification	ImageNet-1K	--	600
Referring Expression Comprehension	RefCOCO (val)	Accuracy78.6	348
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy77.9	337
Visual Question Answering	OK-VQA (test)	Accuracy54	327
Referring Image Segmentation	RefCOCO (val)	--	274
Visual Question Answering	OK-VQA	Accuracy54	272

Showing 10 of 55 rows

Other info

Code

Follow for update

@wizwand_team Discord