Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
About
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc70.16 | 524 | |
| Image Classification | EuroSAT | -- | 497 | |
| Image Classification | Food-101 | -- | 494 | |
| Image Classification | DTD | -- | 487 | |
| Image Classification | SUN397 | -- | 425 | |
| Image Classification | UCF101 | Top-1 Acc68.44 | 404 | |
| Image Classification | StanfordCars | Accuracy66.7 | 266 | |
| Image Classification | CUB | Accuracy56.92 | 249 | |
| Image Classification | FGVCAircraft | Accuracy25.47 | 225 | |
| Semantic segmentation | COCO Stuff | mIoU39.1 | 195 |