Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
About
We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Data Classification | UCI machine learning repository 21 datasets (test) | Median Rank11 | 29 | |
| Classification | DVS-Gesture (test) | Accuracy67.83 | 14 | |
| Tabular Classification | UCI machine learning repository small-sized (test) | Median Rank11 | 7 | |
| Classification | blastchar medium-sized (test) | Accuracy79.98 | 5 | |
| Regression | colleges medium-sized (test) | Mean Squared Error (x1000)25.67 | 5 | |
| Classification | shrutime medium-sized (test) | Accuracy85.62 | 5 | |
| Classification | eye medium-sized (test) | Accuracy53.21 | 5 | |
| Regression | sulfur medium-sized (test) | MSE (x1000)1.24 | 5 |