Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

About

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

Christoffer Koo {\O}hrstr{\o}m, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja G\"uldenring, Florian T. Pokorny, Lazaros Nalpantidis• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc80.6	1239
Image Classification	ImageNet-1k (val)	Top-1 Accuracy81.7	708
Image Classification	ImageNet 1k (test)	Top-1 Accuracy81.7	456
Action Recognition	UCF101	Accuracy49.5	433
Object Detection	COCO	mAP38.9	137
3D Object Classification	ModelNet40	Top-1 Accuracy93.3	89
Semantic segmentation	ScanNet	mIoU71	59
Object Detection	Gen1	mAP34.8	21
Gesture Recognition	DVSGesture	Top-1 Accuracy0.934	20
Multi-modal Perception	nuScenes	mAP68.9	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord