UniT: Multimodal Multitask Learning with a Unified Transformer
About
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP42.3 | 2454 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy67 | 664 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)89.3 | 504 | |
| Object Detection | COCO | APb40.8 | 44 | |
| Visual Entailment | SNLI-VE | Accuracy0.731 | 24 |