CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning
About
In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | Arena Hard | Win Rate58.9 | 263 | |
| Reward Modeling | RewardBench | Chat Score93.6 | 216 | |
| Instruction Following | AlpacaEval 2 | LC (%)68.8 | 137 | |
| Compositional Zero-Shot Learning | C-GQA open world | HM Score11.5 | 65 | |
| Compositional Zero-Shot Learning | UT-Zappos Closed World | HM57 | 57 | |
| Compositional Zero-Shot Learning | C-GQA Closed World | HM32.7 | 56 | |
| Compositional Zero-Shot Learning | UT-Zappos open world | HM49.4 | 52 | |
| Generalized Compositional Zero-Shot Learning | C-GQA (test) | AUC0.099 | 46 | |
| Reward Modeling | RewardBench 2 | Precise IF Score30.9 | 41 | |
| Compositional Zero-Shot Learning | MIT-States open world | HM21.6 | 38 |