Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

About

Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks with unified modeling and shared parameters. Specifically, Uni-Perceiver encodes different task inputs and targets from arbitrary modalities into a unified representation space with a modality-agnostic Transformer encoder and lightweight modality-specific tokenizers. Different perception tasks are modeled as the same formulation, that is, finding the maximum likelihood target for each input through the similarity of their representations. The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage. Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks. The performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data. Full-data fine-tuning further delivers results on par with or better than state-of-the-art results. Code shall be released.

Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai• 2021

Related benchmarks

TaskDatasetResultRank
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)83.8
1155
Visual Question AnsweringVQA v2 (test-dev)--
664
Image ClassificationImageNet-1K
Top-1 Acc87
524
Visual Question AnsweringVQA v2 (test-std)
Accuracy74.1
466
Text-to-Image RetrievalFlickr30K
R@182.1
460
Natural Language UnderstandingGLUE
SST-291.2
452
Image-to-Text RetrievalFlickr30K 1K (test)
R@174.9
439
Natural Language UnderstandingGLUE (test)
SST-2 Accuracy90.2
416
Image-to-Text RetrievalFlickr30K
R@194.7
379
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy73.4
337
Showing 10 of 36 rows

Other info

Follow for update