RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

About

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows $SE(3)$-invariant attention with multi-frequency similarity, and can adapt to the geometry of the underlying 3D scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet these desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays and computes query-frame projective coordinates to ensure $SE(3)$ invariance. To adapt to scene geometry, RayRoPE predicts (without direct supervision) a per-token depth to obtain its position along the corresponding ray, while also modeling uncertainty and analytically computing the expected positional encoding. We validate our method on the tasks of novel-view synthesis and stereo depth estimation. While remaining efficient, RayRoPE consistently improves over alternate position encoding schemes (e.g., 24% relative improvement on LPIPS in RE10K and 15% in CO3D).

Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani• 2026

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	RealEstate10K	PSNR24.94	178
Novel View Synthesis	Re10K (test)	PSNR24.42	79
Novel View Synthesis	Co3D (test)	PSNR18.4	30
Novel View Synthesis	Objaverse	PSNR24.96	17
Depth Estimation	SUN3D	Abs Rel0.109	13
Novel View Synthesis	CO3D unseen categories 29	PSNR19.31	5
Novel View Synthesis	Objaverse 80K (test)	PSNR22.42	5
Stereo Depth Estimation	Scenes11	Abs Rel0.047	3
Stereo Depth Estimation	RGBD	Abs Rel0.106	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord