Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

About

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

Christoffer Koo {\O}hrstr{\o}m, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja G\"uldenring, Florian T. Pokorny, Lazaros Nalpantidis• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc80.6
1239
Image ClassificationImageNet-1k (val)
Top-1 Accuracy81.7
708
Image ClassificationImageNet 1k (test)
Top-1 Accuracy81.7
456
Action RecognitionUCF101
Accuracy49.5
433
Object DetectionCOCO
mAP38.9
137
3D Object ClassificationModelNet40
Top-1 Accuracy93.3
89
Semantic segmentationScanNet
mIoU71
59
Object DetectionGen1
mAP34.8
21
Gesture RecognitionDVSGesture
Top-1 Accuracy0.934
20
Multi-modal PerceptionnuScenes
mAP68.9
6
Showing 10 of 10 rows

Other info

Follow for update