Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

About

The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose WaffleCLIP, a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. Without querying external models, we achieve comparable performance gains on a large number of visual classification tasks. This allows WaffleCLIP to both serve as a low-cost alternative, as well as a sanity check for any future LLM-based vision-language model extensions. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors, and showcase how - if available - semantic context is better leveraged by querying LLMs for high-level concepts, which we show can be done to jointly resolve potential class name ambiguities. Code is available here: https://github.com/ExplainableML/WaffleCLIP.

Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc68.81
524
Image ClassificationEuroSAT--
497
Image ClassificationFood-101--
494
Image ClassificationDTD
Accuracy40.05
487
Image ClassificationImageNet--
429
Image ClassificationSUN397--
425
Image ClassificationUCF101
Top-1 Acc67.19
404
Image ClassificationImageNet 1k (test)
Top-1 Accuracy76.1
359
Image ClassificationImageNet (test)
Top-1 Accuracy75.31
291
Image ClassificationStanfordCars
Accuracy63.57
266
Showing 10 of 37 rows

Other info

Follow for update