Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General
About
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc80.6 | 1239 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy81.7 | 708 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy81.7 | 456 | |
| Action Recognition | UCF101 | Accuracy49.5 | 433 | |
| Object Detection | COCO | mAP38.9 | 137 | |
| 3D Object Classification | ModelNet40 | Top-1 Accuracy93.3 | 89 | |
| Semantic segmentation | ScanNet | mIoU71 | 59 | |
| Object Detection | Gen1 | mAP34.8 | 21 | |
| Gesture Recognition | DVSGesture | Top-1 Accuracy0.934 | 20 | |
| Multi-modal Perception | nuScenes | mAP68.9 | 6 |