Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

About

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.

Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc70.16
524
Image ClassificationEuroSAT--
497
Image ClassificationFood-101--
494
Image ClassificationDTD--
487
Image ClassificationSUN397--
425
Image ClassificationUCF101
Top-1 Acc68.44
404
Image ClassificationStanfordCars
Accuracy66.7
266
Image ClassificationCUB
Accuracy56.92
249
Image ClassificationFGVCAircraft
Accuracy25.47
225
Semantic segmentationCOCO Stuff
mIoU39.1
195
Showing 10 of 30 rows

Other info

Code

Follow for update