Scalable 3D Captioning with Pretrained Models

About

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

Tiange Luo, Chris Rockwell, Honglak Lee, Justin Johnson• 2023

Related benchmarks

Task	Dataset	Result
3D Object Captioning	Objaverse	--	38
Text-to-3D	Objaverse 1.0 (test)	CLIP Score80.4	11
3D Object Captioning	Cap3D subset of 3186 3D objects (test)	CLIP Image-Text Score0.287	10
3D Object Captioning	Objaverse-LVIS 1k sampled	CLIPScore78.6	9
3D Object Captioning	ABO 6.4k objects	CLIPScore74.8	9
3D Object Captioning	Objaverse-XL (5k sampled)	CLIPScore76.4	8
3D Object Description	ShapeNet-Core	CLIP Score76.5	8
3D Object Description	ScanNet	CLIP Score73.2	8
3D Object Description	ModelNet40	CLIP Score74.3	8
Geometry Captioning	ABO Fine-Grained Geometry Captions (test)	Win %88.21	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord