Dataset Summarization by K Principal Concepts
About
We propose the new task of K principal concept identification for dataset summarizarion. The objective is to find a set of K concepts that best explain the variation within the dataset. Concepts are high-level human interpretable terms such as "tiger", "kayaking" or "happy". The K concepts are selected from a (potentially long) input list of candidates, which we denote the concept-bank. The concept-bank may be taken from a generic dictionary or constructed by task-specific prior knowledge. An image-language embedding method (e.g. CLIP) is used to map the images and the concept-bank into a shared feature space. To select the K concepts that best explain the data, we formulate our problem as a K-uncapacitated facility location problem. An efficient optimization technique is used to scale the local search algorithm to very large concept-banks. The output of our method is a set of K principal concepts that summarize the dataset. Our approach provides a more explicit summary in comparison to selecting K representative images, which are often ambiguous. As a further application of our method, the K principal concepts can be used to classify the dataset into K groups. Extensive experiments demonstrate the efficacy of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Clustering | CIFAR-10 | NMI0.859 | 243 | |
| Image Clustering | STL-10 | ACC97.9 | 229 | |
| Clustering | CIFAR100 20 | ACC0.484 | 93 | |
| Grouping | Imagenet Dogs | ACC69.1 | 59 | |
| Grouping | Stanford Activity | ACC64.9 | 4 | |
| Grouping | All-Age-Faces | ACC55.4 | 4 | |
| Grouping | PPMI+ | ACC49 | 4 | |
| Concept Retrieval | CIFAR-10 | Path Similarity6.07 | 2 | |
| Concept Retrieval | CIFAR-20 | Path Similarity4.36 | 2 | |
| Concept Retrieval | STL-10 | Path Similarity5.88 | 2 |