Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles
About
Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Stylization | TnT Truck scene | ArtScore7.07 | 15 | |
| 3D Stylization | TnT (M60 scene) | ArtScore8.43 | 15 | |
| Multi-view consistency | Garden scene Long-range AnyStyle | LPIPS0.185 | 11 | |
| Multi-view consistency | Truck scene Short-range AnyStyle | LPIPS0.049 | 11 | |
| Multi-view consistency | Garden scene short-range AnyStyle | LPIPS0.085 | 11 | |
| Multi-view consistency | AnyStyle Scene Long-range (train) | LPIPS0.109 | 11 | |
| Short-range Multi-view Consistency | Tanks and Temples short-range | Average LPIPS0.056 | 11 | |
| Multi-view consistency | M60 scene AnyStyle (short-range) | LPIPS0.064 | 11 | |
| Multi-view consistency | Truck scene Long-range AnyStyle | LPIPS0.136 | 11 | |
| Multi-view consistency | M60 scene Long-range AnyStyle | LPIPS0.16 | 11 |