Perceiver IO: A General Architecture for Structured Inputs & Outputs
About
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)89.9 | 504 | |
| Optical Flow Estimation | KITTI 2015 (train) | Fl-epe4.98 | 431 | |
| Natural Language Understanding | GLUE (test) | -- | 416 | |
| Optical Flow | Sintel (train) | AEPE (Clean)1.81 | 179 | |
| Optical Flow | KITTI 2015 (test) | -- | 95 | |
| Optical Flow | Sintel Final (train) | EPE2.42 | 92 | |
| Optical Flow | Sintel Clean (train) | EPE1.81 | 85 | |
| Robotic Manipulation | RLBench | Avg Success Score0.494 | 56 | |
| Image Classification | ImageNet1K (val) | Top-1 Accuracy82.1 | 29 |