Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

About

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani• 2026

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisRe10K (test)
PSNR24.42
66
Novel View SynthesisCo3D (test)
PSNR18.4
30
Depth EstimationSUN3D
Abs Rel0.109
13
Novel View SynthesisCO3D unseen categories 29
PSNR19.31
5
Novel View SynthesisObjaverse 80K (test)
PSNR22.42
5
Stereo Depth EstimationScenes11
Abs Rel0.047
3
Stereo Depth EstimationRGBD
Abs Rel0.106
3
Showing 7 of 7 rows

Other info

Follow for update