Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
About
To build an artificial neural network like the biological intelligence system, recent works have unified numerous tasks into a generalist model, which can process various tasks with shared parameters and do not have any task-specific modules. While generalist models achieve promising results on various benchmarks, they have performance degradation on some tasks compared with task-specialized models. In this work, we find that interference among different tasks and modalities is the main factor to this phenomenon. To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models. Routing strategies under different levels of conditions are proposed to take both the training/inference cost and generalization ability into account. By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data. Moreover, the introduction of Conditional MoEs still holds the generalization ability of generalist models to conduct zero-shot inference on new tasks, e.g., video-text retrieval and video caption. Code and pre-trained generalist models shall be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc87 | 600 | |
| Image Classification | Flowers102 | Accuracy89.8 | 558 | |
| Text-to-Image Retrieval | Flickr30K | R@183.7 | 531 | |
| Natural Language Understanding | GLUE | SST-293.4 | 531 | |
| Image-to-Text Retrieval | Flickr30K | R@194.1 | 429 | |
| Image Classification | ImageNet | Top-1 Accuracy77.7 | 366 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc87 | 303 | |
| Text-to-Video Retrieval | MSVD | R@152.3 | 264 | |
| Video Classification | Kinetics 400 (val) | Top-1 Acc84.2 | 204 | |
| Video Captioning | MSVD | -- | 157 |