Fast Weakly Supervised Action Segmentation Using Mutual Consistency
About
Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being $14$ times faster to train and $20$ times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Segmentation | Breakfast | F1@1073.2 | 107 | |
| Temporal action segmentation | Breakfast | Accuracy47.1 | 96 | |
| Action Segmentation | Breakfast | MoF50.7 | 66 | |
| Action Segmentation | Breakfast 14 | MoF49.7 | 26 | |
| Action Alignment | Breakfast | IoD66.2 | 18 | |
| Action Alignment | Hollywood Extended | IoD52.3 | 15 | |
| Action Segmentation | Hollywood Extended | -- | 10 | |
| Weakly-supervised Action Segmentation | Hollywood Extended | IoU13.9 | 9 | |
| Action Segmentation | Breakfast dataset (All splits) | MoF48.5 | 7 | |
| Action Segmentation | Hollywood Extended (avg) | Mof-bg41.6 | 6 |