Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

About

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, Samy Bengio• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Accuracy83.4
92
Image ClassificationOxford-IIIT
Accuracy90.5
32
Widget CaptioningWidget Captioning (test)
CIDEr100.7
17
Semantic segmentationCityscapes 80-20 (val)
Accuracy83.2
14
Geometric ProbingThree-Cell Experiment 1.0 (test)
Distance93.1
11
Showing 5 of 5 rows

Other info

Code

Follow for update