DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
About
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generalized Zero-Shot Learning | CUB | H Score67.5 | 250 | |
| Generalized Zero-Shot Learning | SUN | H45.8 | 184 | |
| Generalized Zero-Shot Learning | AWA2 | S Score84.7 | 165 | |
| Zero-shot Learning | CUB | Top-1 Accuracy72.3 | 144 | |
| Zero-shot Learning | SUN | Top-1 Accuracy64.4 | 114 | |
| Zero-shot Learning | AWA2 | Top-1 Accuracy0.699 | 95 | |
| Zero-shot Image Classification | AWA2 (test) | Metric U63.7 | 46 | |
| Zero-shot Image Classification | CUB | U Score62.9 | 34 | |
| Image Classification | AWA2 v1 (test) | Score U63.7 | 19 | |
| Image Classification | SUN Attribute (test) | U Score45.7 | 19 |