Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

About

Generating shapes using natural language can enable new ways of imagining and creating the things around us. While significant recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zero-shot text-to-shape generation that circumvents such data scarcity. Our proposed method, named CLIP-Forge, is based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method has the benefits of avoiding expensive inference time optimization, as well as the ability to generate multiple shapes for a given text. We not only demonstrate promising zero-shot generalization of the CLIP-Forge model qualitatively and quantitatively, but also provide extensive comparative evaluations to better understand its behavior.

Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, Kamal Rahimi Malekshan• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Shape GenerationShapeNet13
FID2.10e+3
9
Text-conditioned 3D GenerationShapeNetCore13 (test)
Accuracy83.33
4
Text-conditioned 3D shape generationText2Shape (original)
CLIP-S26.34
4
3D GenerationModelNet40 Chair (test)
FPD826
3
3D GenerationModelNet40 Table (test)
FPD3.05e+3
3
Text-to-Shape GenerationShapeNet (test)
FID112.4
2
Showing 6 of 6 rows

Other info

Follow for update