UniT: Multimodal Multitask Learning with a Unified Transformer

About

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.

Ronghang Hu, Amanpreet Singh• 2021

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP42.3	2843
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy67	712
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)89.3	529
Object Detection	COCO	APb40.8	44
Image-to-Image Translation	CD3 (test)	PSNR17.28	28
Virtual Staining	IHC(CK8/18) (test)	PSNR12.36	27
Visual Entailment	SNLI-VE	Accuracy0.731	24
Virtual Staining	HEMIT 13 (full dataset)	PSNR19.54	24
Multi-Task Social Perception	Social Perception	Concurrent F190.55	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord