CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning
About
In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generalized Compositional Zero-Shot Learning | C-GQA (test) | AUC0.099 | 46 | |
| Compositional Zero-Shot Learning | UT-Zappos Closed World | HM57 | 42 | |
| Compositional Zero-Shot Learning | C-GQA Closed World | HM32.7 | 41 | |
| Compositional Zero-Shot Learning | UT-Zappos open world | HM49.4 | 38 | |
| Compositional Zero-Shot Learning | MIT-States open world | HM21.6 | 38 | |
| Compositional Zero-Shot Learning | C-GQA open world | HM Score11.5 | 35 | |
| Compositional Zero-Shot Learning | VAW CZSL (test) | HM34.6 | 14 | |
| Compositional Zero-Shot Learning | MIT-States Closed World (test) | AUC23.4 | 12 |