Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

About

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO Object	Success Rate78.4	127
Camera pose estimation	RealEstate10K	--	46
Novel View Synthesis	RE10K (Medium)	PSNR25.246	41
Novel View Synthesis	RE10K (Average)	PSNR25.397	41
Robotic Manipulation	Franka-Kitchen	Avg Success Rate44.5	39
Visuomotor Control	LIBERO Goal	Success Rate67.3	22
Novel View Synthesis	RE10K (small overlap)	PSNR22.765	16
Novel View Synthesis	RE10K large overlap	PSNR27.872	16
3D Object Detection	EmbodiedScan	AP@0.2528.69	13
3D Open-vocabulary Semantic Segmentation	ScanNet Source View	mIoU55.63	9

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord