Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

About

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO Object
Success Rate78.4
70
Robotic ManipulationFranka-Kitchen
Avg Success Rate44.5
39
Novel View SynthesisRE10K (Medium)
PSNR25.246
33
Novel View SynthesisRE10K (Average)
PSNR25.397
33
Camera pose estimationRealEstate10K--
26
Visuomotor ControlLIBERO Goal
Success Rate67.3
22
3D Object DetectionEmbodiedScan
AP@0.2528.69
13
3D Open-vocabulary Semantic SegmentationScanNet Source View
mIoU55.63
9
Embodied AIVC-1 AD
Success Rate61.7
9
Embodied AIVC-1 MW
Success Rate94.3
9
Showing 10 of 29 rows

Other info

Follow for update