Compositional Zero-shot Learning via Progressive Language-based Observations

About

Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify a vintage design for a "car" or an advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.

Lin Li, Guikun Chen, Zhen Wang, Jun Xiao, Long Chen• 2023

Related benchmarks

Task	Dataset	Result
Compositional Zero-Shot Learning	C-GQA open world	HM Score13.9	65
Compositional Zero-Shot Learning	UT-Zappos Closed World	HM55.3	57
Compositional Zero-Shot Learning	C-GQA Closed World	HM33	56
Compositional Zero-Shot Learning	UT-Zappos open world	HM47.8	52
Compositional Zero-Shot Learning	MIT-States open world	HM21.4	38
Compositional Zero-Shot Learning	MIT-States Closed World	Harmonic Mean (HM)0.402	32
Compositional Zero-Shot Learning	MIT-States Closed World (test)	AUC23.4	27
Compositional Zero-Shot Learning	MIT-States Open-world (test)	Seen Accuracy49.7	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord