CFM: Language-aligned Concept Foundation Model for Vision

About

Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code at https://github.com/kawi19/CFM.

Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet (test)	Top-1 Accuracy78.9	299
Open Vocabulary Semantic Segmentation	Pascal VOC 20	mIoU80.7	113
Open-Vocabulary Segmentation	Cityscapes	mIoU31.5	49
Open Vocabulary Semantic Segmentation	COCO Stuff	mIoU24	48
Open-Vocabulary Segmentation	COCO Object	mIoU33.3	40
Open-Vocabulary Segmentation	Pascal Context	mIoU33.2	33
Open Vocabulary Semantic Segmentation	Pascal VOC	mIoU61.6	27
Open-Vocabulary Segmentation	ADE20K	mIoU20.4	22
Open Vocabulary Semantic Segmentation	Pascal Context 59	mIoU36.5	16
Image Classification	Places365 (test)	Accuracy55.4	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord