Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
About
We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 60 (X-sub) | Accuracy64.82 | 467 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 (X-set) | Top-1 Accuracy52.8 | 184 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 Cross-Subject | Top-1 Accuracy57.3 | 143 | |
| Action Recognition | NTU RGB+D 120 (Cross-View) | Accuracy51.93 | 47 | |
| Action Recognition | NTU 60 (55/5 split) | Top-1 Acc64.82 | 35 | |
| Action Recognition | NTU-120 110/10 split | Top-1 Acc51.93 | 34 | |
| Skeleton Action Recognition | NTU RGB+D Cross-Subject (Xsub) 120 | Accuracy38.1 | 29 | |
| Action Recognition | NTU-60 48/12 split | Top-1 Acc28.75 | 27 | |
| Multi-Instance Retrieval | Epic Kitchens 100 | mAP (Avg)44 | 19 | |
| Action Recognition | NTU-120 96/24 split | Top-1 Acc32.44 | 18 |