Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

About

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

Christoffer Koo {\O}hrstr{\o}m, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja G\"uldenring, Florian T. Pokorny, Lazaros Nalpantidis• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc80.6
836
Image ClassificationImageNet-1k (val)
Top-1 Accuracy81.7
512
Action RecognitionUCF101
Accuracy49.5
365
Image ClassificationImageNet 1k (test)
Top-1 Accuracy81.7
359
Object DetectionCOCO
mAP38.9
107
3D Object ClassificationModelNet40--
62
Semantic segmentationScanNet
mIoU71
59
Gesture RecognitionDVSGesture
Top-1 Accuracy0.934
16
Object DetectionGen1
mAP34.8
10
Multi-modal PerceptionnuScenes
mAP68.9
6
Showing 10 of 10 rows

Other info

Follow for update