Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

About

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .

Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan• 2024

Related benchmarks

Task	Dataset	Result
Generalized Zero-Shot Learning	CUB	H Score73.6	307
Generalized Zero-Shot Learning	SUN	H47.3	229
Generalized Zero-Shot Learning	AWA2	H Score74.2	217
Zero-shot Learning	CUB	Top-1 Accuracy78.9	183
Zero-shot Learning	AWA2	Top-1 Accuracy0.707	133
Zero-shot Learning	SUN	Top-1 Accuracy68.3	132
Image Classification	CUB	Harmonic Mean Top-1 Acc73.6	106
Image Classification	AWA2 GZSL	H (Harmonic Mean)74.2	49
Zero-shot Image Classification	AWA2 (test)	Metric U66.1	46
Zero-shot Image Classification	CUB	U Score69.4	34

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord