Hierarchical Vector Quantization for Unsupervised Action Segmentation

About

In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.

Federico Spurio, Emad Bahrami, Gianpiero Francesca, Juergen Gall• 2024

Related benchmarks

Task	Dataset	Result
Action Segmentation	Breakfast	MoF54.4	78
Action Segmentation	YouTube Instructions	F135.1	28
Temporal Segmentation	Keck	Accuracy73.5	18
Temporal Segmentation	Weizmann	ACC73	18
Unsupervised Temporal Action Segmentation	Breakfast	MOF54.4	16
Temporal action segmentation	YouTube Instructional YTI (test)	F1 Score35.1	11
Temporal action segmentation	Breakfast (80:20)	MOF44.2	5
Temporal action segmentation	IKEA ASM (test)	MOF51.2	5
Video Action Segmentation	YouTube Instructions (80:20 split)	MOF50.7	5

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord