RayRoPE: Projective Ray Positional Encoding for Multi-view Attention
About
We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows $SE(3)$-invariant attention with multi-frequency similarity, and can adapt to the geometry of the underlying 3D scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet these desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays and computes query-frame projective coordinates to ensure $SE(3)$ invariance. To adapt to scene geometry, RayRoPE predicts (without direct supervision) a per-token depth to obtain its position along the corresponding ray, while also modeling uncertainty and analytically computing the expected positional encoding. We validate our method on the tasks of novel-view synthesis and stereo depth estimation. While remaining efficient, RayRoPE consistently improves over alternate position encoding schemes (e.g., 24% relative improvement on LPIPS in RE10K and 15% in CO3D).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Re10K (test) | PSNR24.42 | 79 | |
| Novel View Synthesis | Co3D (test) | PSNR18.4 | 30 | |
| Depth Estimation | SUN3D | Abs Rel0.109 | 13 | |
| Novel View Synthesis | CO3D unseen categories 29 | PSNR19.31 | 5 | |
| Novel View Synthesis | Objaverse 80K (test) | PSNR22.42 | 5 | |
| Stereo Depth Estimation | Scenes11 | Abs Rel0.047 | 3 | |
| Stereo Depth Estimation | RGBD | Abs Rel0.106 | 3 |