An end-to-end generative framework for video segmentation and recognition
About
We describe an end-to-end generative approach for the segmentation and recognition of human activities. In this approach, a visual representation based on reduced Fisher Vectors is combined with a structured temporal model for recognition. We show that the statistical properties of Fisher Vectors make them an especially suitable front-end for generative models such as Gaussian mixtures. The system is evaluated for both the recognition of complex activities as well as their parsing into action units. Using a variety of video datasets ranging from human cooking activities to animal behaviors, our experiments demonstrate that the resulting architecture outperforms state-of-the-art approaches for larger datasets, i.e. when sufficient amount of data is available for training structured generative models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Segmentation | Breakfast | -- | 107 | |
| Temporal action segmentation | Breakfast | Accuracy56.3 | 96 | |
| Action Segmentation | Breakfast (test) | MoF56.3 | 31 | |
| Action Segmentation | Breakfast 14 | MoF56.3 | 26 | |
| Action Segmentation | Breakfast Action dataset | MoF56.3 | 22 | |
| Action Segmentation | 50Salads mid granularity | MoF24.7 | 19 | |
| Action Alignment | Breakfast | IoD42.6 | 18 | |
| Action Alignment | Hollywood Extended | IoD46.9 | 15 | |
| Action Recognition | Breakfast (1357:335) | Accuracy73.3 | 13 | |
| Action Segmentation | Breakfast (avg) | Mof25.9 | 9 |