Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

About

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows $SE(3)$-invariant attention with multi-frequency similarity, and can adapt to the geometry of the underlying 3D scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet these desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays and computes query-frame projective coordinates to ensure $SE(3)$ invariance. To adapt to scene geometry, RayRoPE predicts (without direct supervision) a per-token depth to obtain its position along the corresponding ray, while also modeling uncertainty and analytically computing the expected positional encoding. We validate our method on the tasks of novel-view synthesis and stereo depth estimation. While remaining efficient, RayRoPE consistently improves over alternate position encoding schemes (e.g., 24% relative improvement on LPIPS in RE10K and 15% in CO3D).

Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani• 2026

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisRe10K (test)
PSNR24.42
79
Novel View SynthesisCo3D (test)
PSNR18.4
30
Depth EstimationSUN3D
Abs Rel0.109
13
Novel View SynthesisCO3D unseen categories 29
PSNR19.31
5
Novel View SynthesisObjaverse 80K (test)
PSNR22.42
5
Stereo Depth EstimationScenes11
Abs Rel0.047
3
Stereo Depth EstimationRGBD
Abs Rel0.106
3
Showing 7 of 7 rows

Other info

Follow for update