Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
About
Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generalized Zero-Shot Learning | CUB | H Score73.6 | 250 | |
| Generalized Zero-Shot Learning | SUN | H47.3 | 184 | |
| Generalized Zero-Shot Learning | AWA2 | S Score84.6 | 165 | |
| Zero-shot Learning | CUB | Top-1 Accuracy78.9 | 144 | |
| Zero-shot Learning | SUN | Top-1 Accuracy68.3 | 114 | |
| Zero-shot Learning | AWA2 | Top-1 Accuracy0.707 | 95 | |
| Image Classification | CUB | Unseen Top-1 Acc69.4 | 89 | |
| Zero-shot Image Classification | AWA2 (test) | Metric U66.1 | 46 | |
| Zero-shot Image Classification | CUB | U Score69.4 | 34 | |
| Image Classification | AWA2 GZSL | Acc (Unseen)66.1 | 32 |