Perceiver: General Perception with Iterative Attention

About

Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira• 2021

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100 (test)	Accuracy52.64	3518
Image Classification	CIFAR-10 (test)	Accuracy82.52	3381
Image Classification	ImageNet-1k (val)	--	1498
Optical Flow Estimation	KITTI 2015 (train)	Fl-epe4.98	446
Point Cloud Classification	ModelNet40 (test)	Accuracy85.7	229
Optical Flow	Sintel (train)	AEPE (Clean)1.81	200
Survival Prediction	TCGA-BRCA (test)	Concordance Index (CI)0.566	67
Audio Classification	AudioSet	mAP38.4	60
In-hospital mortality prediction	MIMIC-III (test)	--	59
Classification	AudioSet (test)	mAP38.4	57

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord