Efficient 3D Content Reconstruction and Generation
About
Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-image 3D Reconstruction | ABO dataset (test) | FID27.88 | 7 | |
| Single-image 3D Reconstruction | GSO dataset (test) | FID30.01 | 7 | |
| Multi-view consistency | DreamFusion 414 text prompts (test) | Avg MRC8.92 | 7 | |
| Text-to-3D | Shap-E prompts (test) | R-Prec55.14 | 6 | |
| Structure-from-Motion | Tanks and Temples (train) | ATE0.0032 | 4 | |
| Structure-from-Motion | Tanks & Temples Advanced | ATE0.0068 | 4 | |
| Structure-from-Motion | Tanks & Temples Intermediate | ATE9.20e-5 | 4 | |
| Text-to-3D Generation | DreamFusion 400 text prompts (test) | CLIP Score (ViT-L/14)26.87 | 4 | |
| Camera pose estimation | Mip-NeRF 360 (average across 9 scenes) | Time (sec)33 | 3 | |
| Camera pose estimation | Tanks and Temples advanced (average over 6 scenes) | Time (sec)61 | 3 |