Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-modal Representation Learning for Zero-shot Action Recognition

About

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.

Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu• 2022

Related benchmarks

TaskDatasetResultRank
Video RecognitionUCF101 v1 (test)
Accuracy46.7
21
Video RecognitionHMDB51 (test)
Accuracy34.4
19
Action RecognitionUCF101 half classes (test)
Accuracy58.7
18
Video RecognitionHMDB51 (Evaluation Protocol 2)
Accuracy41.1
12
Video RecognitionUCF101 (Evaluation Protocol 2)
Accuracy0.587
12
Video RecognitionHMDB51 (Evaluation Protocol 1)
Accuracy34.4
7
Video RecognitionHMDB-51 half classes (test)
Accuracy41.1
6
Action RecognitionActivityNet half classes (test)
Accuracy32.5
5
Video RecognitionActivityNet full classes (test)
Accuracy26.3
3
Showing 9 of 9 rows

Other info

Follow for update