Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer

About

Concept Bottleneck Models (CBMs) offer inherent interpretability by initially translating images into human-comprehensible concepts, followed by a linear combination of these concepts for classification. However, the annotation of concepts for visual recognition tasks requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks, with multimodal models like CLIP subsequently mapping image features into the concept feature space for classification. Despite this, the concepts produced by language models can be verbose and may introduce non-visual attributes, which hurts accuracy and interpretability. In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models. To this end, we adopt common words as base concept vocabulary and leverage auxiliary unlabeled images to construct a Vision-to-Concept (V2C) tokenizer that can explicitly quantize images into their most relevant visual concepts, thus creating a vision-oriented concept bottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM which is training efficient and interpretable with high accuracy. Our V2C-CBM has matched or outperformed LLM-supervised CBMs on various visual classification benchmarks, validating the efficacy of our approach.

Hangzhou He, Lei Zhu, Xinliang Zhang, Shuang Zeng, Qian Chen, Yanye Lu• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationFlowers102
Accuracy96.6
558
Image ClassificationFood-101
Accuracy92.2
542
Image ClassificationFood101
Accuracy81.2
457
Image ClassificationCUB-200 2011
Accuracy80.8
356
Image ClassificationRESISC45--
349
Image ClassificationImageNet (test)
Top-1 Accuracy84.15
299
Image ClassificationDTD (test)
Accuracy78.49
257
Image ClassificationOxford Flowers 102--
234
Image ClassificationCIFAR100 (test)
Accuracy86.41
206
Image ClassificationFood (test)
Accuracy92.84
124
Showing 10 of 28 rows

Other info

Follow for update