CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
About
Generating shapes using natural language can enable new ways of imagining and creating the things around us. While significant recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zero-shot text-to-shape generation that circumvents such data scarcity. Our proposed method, named CLIP-Forge, is based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method has the benefits of avoiding expensive inference time optimization, as well as the ability to generate multiple shapes for a given text. We not only demonstrate promising zero-shot generalization of the CLIP-Forge model qualitatively and quantitatively, but also provide extensive comparative evaluations to better understand its behavior.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Shape Generation | ShapeNet13 | FID2.10e+3 | 9 | |
| Text-conditioned 3D Generation | ShapeNetCore13 (test) | Accuracy83.33 | 4 | |
| Text-conditioned 3D shape generation | Text2Shape (original) | CLIP-S26.34 | 4 | |
| 3D Generation | ModelNet40 Chair (test) | FPD826 | 3 | |
| 3D Generation | ModelNet40 Table (test) | FPD3.05e+3 | 3 | |
| Text-to-Shape Generation | ShapeNet (test) | FID112.4 | 2 |