Actor and Action Video Segmentation from a Sentence
About
This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video segmentation from a sentence | A2D Sentences (test) | Overall IoU57.4 | 122 | |
| Referring Video Object Segmentation | JHMDB Sentences (test) | Overall IoU0.555 | 83 | |
| Referring Video Object Segmentation | A2D-Sentences | oIoU53.6 | 57 | |
| Referring Video Object Segmentation | JHMDB Sentences | Overall IoU54.1 | 56 | |
| Referring Video Segmentation | JHMDB Sentences (test) | mAP (0.5:0.95)23.3 | 35 | |
| Referring Video Object Segmentation | A2D Sentences v1.0 (test) | IoU Overall55.1 | 26 | |
| Segmentation from a sentence | J-HMDB Sentences (test) | P@0.50.712 | 20 | |
| Referring Video Object Segmentation | A2D-S (test) | oIoU55.1 | 17 | |
| Referring Video Segmentation | JHMDB Sentences | Precision @ 0.569.9 | 16 | |
| Text-based Video Segmentation | A2D-Sentences | mAP (0.5:0.95)21.5 | 11 |