Chirpy3D: Part-Aware Multi-View Diffusion for Creative Fine-Grained Object Generation
About
Understanding and generating the fine-grained structure of objects -- such as birds with species-specific beaks, wings, and tails -- is a long-standing challenge in computer vision. We propose Chirpy3D, a part-aware multi-view diffusion framework that learns a hierarchical part latent space from unposed 2D images, using only off-the-shelf 2D part segmentation masks as spatial guidance -- without requiring any 3D data, camera poses, or manual part annotations. This latent space enables intuitive part-level swapping, interpolation, and zero-shot composition. A self-supervised feature consistency loss further encourages structural alignment across views, allowing coherent generation even with hybrid or unseen part combinations. Our core contribution is the controllable part-aware latent space and multi-view diffusion model. Downstream 3D generation is supported via any differentiable renderer such as NeRF but is orthogonal to the main framework, making Chirpy3D a flexible foundation for creative object generation in the absence of structured 3D data. Code is released at https://github.com/kamwoh/chirpy3d.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-view Generation | CUB 200-2011 (train) | DINO Score0.38 | 3 | |
| Part Disentanglement | Multi-view images 1,000 samples | IoU95.7 | 3 | |
| Part Composition | CUB-200-2011 (test) | EMR29.5 | 3 | |
| Novel Species Generation | Novel Species Generation | Entropy (H)4.81 | 2 |