Objects2action: Classifying and localizing actions without any video example
About
The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 (test) | Accuracy38.9 | 307 | |
| Action Recognition | HMDB51 (test) | Accuracy24.5 | 249 | |
| Action Recognition | UCF101 (3 splits) | Accuracy30.3 | 155 | |
| Action Recognition | HMDB51 | Top-1 Acc15.6 | 30 | |
| Zero-Shot Video Classification | UCF | Top-1 Acc30.3 | 16 | |
| Action Recognition | UCF101 | Top-1 Accuracy30.3 | 15 | |
| Zero-Shot Video Classification | HMDB | Top-1 Accuracy15.6 | 11 | |
| Event Retrieval | TRECVID MED 2013 (test) | mAP4.21 | 5 | |
| Action Recognition | UCF101 (50-51 split) | Mean Accuracy30.3 | 5 | |
| Action Recognition | HMDB51 (26/25) | Accuracy15.6 | 5 |