Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable 3D Captioning with Pretrained Models

About

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

Tiange Luo, Chris Rockwell, Honglak Lee, Justin Johnson• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-3DObjaverse 1.0 (test)
CLIP Score80.4
11
3D Object CaptioningObjaverse--
10
3D Object CaptioningObjaverse-LVIS 1k sampled
CLIPScore78.6
9
3D Object CaptioningABO 6.4k objects
CLIPScore74.8
9
3D Object CaptioningObjaverse-XL (5k sampled)
CLIPScore76.4
8
3D Object DescriptionShapeNet-Core
CLIP Score76.5
8
3D Object DescriptionScanNet
CLIP Score73.2
8
3D Object DescriptionModelNet40
CLIP Score74.3
8
Geometry CaptioningABO Fine-Grained Geometry Captions (test)
Win %88.21
4
Text-to-3DObjaverse 1.0 (test)--
4
Showing 10 of 11 rows

Other info

Follow for update