Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-view Pyramid Transformer: Look Coarser to See Broader

About

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park• 2025

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisTanks&Temples (test)
PSNR22.36
257
Novel View SynthesisMip-NeRF 360
PSNR25.12
143
Novel View SynthesisTanks&Temples
PSNR22.36
95
Novel View SynthesisMip-NeRF360 (test)
PSNR25.12
62
Novel View SynthesisDL3DV (test)
PSNR29.67
61
Novel View SynthesisDL3DV (evaluation)
PSNR29.42
22
Novel View SynthesisRE10K 256x256 (test)
PSNR33.4
9
Novel View SynthesisDL3DV 16 views high-resolution (960 × 540)
PSNR23.76
6
Novel View SynthesisDL3DV 32 views high-resolution (960 × 540)
PSNR25.96
6
Novel View SynthesisDL3DV high-resolution (960 × 540) (64 views)
PSNR27.73
6
Showing 10 of 13 rows

Other info

GitHub

Follow for update