Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

About

Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.

Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia• 2022

Related benchmarks

TaskDatasetResultRank
Multi-Label ClassificationNUS-WIDE (test)
mAP37.6
112
Multi-Label ClassificationNUS-WIDE 925/81 (unseen)
mAP (Mean Average Precision)37.6
43
Multi-Label ClassificationNUS-WIDE 81 unseen labels (test)
mAP0.603
17
Multi-label recognitionNUS-WIDE seen & unseen
F1 Score @ 322
10
Remote Sensing Image ClassificationMultiScene-OV GZSL
F1@337.1
7
Remote Sensing Image ClassificationMLRSNet-OV GZSL
F1@326.4
7
Pedestrian Attribute RecognitionPA100K OV (test)
F1@318.7
6
Pedestrian Attribute RecognitionRAP-OV (test)
F1@319
6
Text-Label ClassificationOpen Images 3756 text labels
mAP82.52
4
Multi-Label ClassificationNUS-WIDE GZSL
mAP0.215
3
Showing 10 of 10 rows

Other info

Code

Follow for update