Temporal Convolutional Networks: A Unified Approach to Action Segmentation

About

The dominant paradigm for video-based action segmentation is composed of two steps: first, for each frame, compute low-level features using Dense Trajectories or a Convolutional Neural Network that encode spatiotemporal information locally, and second, input these features into a classifier that captures high-level temporal relationships, such as a Recurrent Neural Network (RNN). While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.

Colin Lea, Rene Vidal, Austin Reiter, Gregory D. Hager• 2016

Related benchmarks

Task	Dataset	Result
Human Activity Recognition	REALDISP	F195.11	94
SOC Prediction	FTP-75	MAE0.3888	35
SOC Prediction	Average (FTP-75 and PDMHC)	MAE1.0204	35
SOC Prediction	PDMHC	MAE1.652	35
Action Triplet Recognition	CholecT50 (test)	Top-5 Accuracy54.5	30
Activity Recognition	PAMAP2	Accuracy95.27	22
Action Segmentation	JIGSAWS	Accuracy81.4	19
Blood glucose prediction	T1DEXI	RMSE41.09	19
SST forecasting	OISST	RMSE0.682	18
Action Recognition	JIGSAWS Suturing (LOSO)	Per-frame Accuracy79.6	18

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord