Diffusion Action Segmentation

About

Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation.

Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah, Chang Xu• 2023

Related benchmarks

Task	Dataset	Result
Action Segmentation	Breakfast	Acc76.4	127
Temporal action segmentation	Breakfast	Accuracy75.1	119
Temporal action segmentation	50Salads	Accuracy88.9	117
Action Segmentation	50Salads	Edit Distance85	114
Temporal action segmentation	GTEA	F1 Score @ 10% Threshold92.5	105
Activity Recognition	HHAR (test)	Mean F1 Score0.5676	46
Time-series classification	fNIRS (test)	F1 Score0.7115	36
Sleep stage scoring	Sleep (test)	F1 Score50.63	36
Action Segmentation	GTEA	F1@1092.5	23
Temporal action segmentation	50 Salads 65	F1@1090.1	22

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord