Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
About
We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Segmentation | Breakfast | MoF63.3 | 66 | |
| Action Segmentation | 50 Salads Mid | -- | 17 | |
| Action Segmentation | YouTube Instructions | F163.3 | 16 | |
| Action Segmentation | 50 Salads (eval) | MoF64.5 | 13 | |
| Temporal action segmentation | YouTube Instructional YTI (test) | F1 Score35.1 | 11 | |
| Unsupervised Temporal Action Segmentation | Breakfast | MOF56.1 | 10 | |
| Action Segmentation | Desktop Assembly | MoF73.4 | 7 | |
| Temporal action segmentation | IKEA ASM (test) | MOF34 | 5 |