Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
About
To build an artificial neural network like the biological intelligence system, recent works have unified numerous tasks into a generalist model, which can process various tasks with shared parameters and do not have any task-specific modules. While generalist models achieve promising results on various benchmarks, they have performance degradation on some tasks compared with task-specialized models. In this work, we find that interference among different tasks and modalities is the main factor to this phenomenon. To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models. Routing strategies under different levels of conditions are proposed to take both the training/inference cost and generalization ability into account. By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data. Moreover, the introduction of Conditional MoEs still holds the generalization ability of generalist models to conduct zero-shot inference on new tasks, e.g., video-text retrieval and video caption. Code and pre-trained generalist models shall be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc87 | 524 | |
| Image Classification | Flowers102 | Accuracy89.8 | 478 | |
| Text-to-Image Retrieval | Flickr30K | R@183.7 | 460 | |
| Natural Language Understanding | GLUE | SST-293.4 | 452 | |
| Image-to-Text Retrieval | Flickr30K | R@194.1 | 379 | |
| Image Classification | ImageNet | Top-1 Accuracy77.7 | 324 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc87 | 287 | |
| Text-to-Video Retrieval | MSVD | R@152.3 | 218 | |
| Video Classification | Kinetics 400 (val) | Top-1 Acc84.2 | 204 | |
| Image Retrieval | Flickr30K | R@175.9 | 144 |