Temporal Convolutional Networks: A Unified Approach to Action Segmentation
About
The dominant paradigm for video-based action segmentation is composed of two steps: first, for each frame, compute low-level features using Dense Trajectories or a Convolutional Neural Network that encode spatiotemporal information locally, and second, input these features into a classifier that captures high-level temporal relationships, such as a Recurrent Neural Network (RNN). While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Activity Recognition | REALDISP | F195.11 | 94 | |
| Action Triplet Recognition | CholecT50 (test) | Top-5 Accuracy54.5 | 30 | |
| Activity Recognition | PAMAP2 | Accuracy95.27 | 22 | |
| Action Segmentation | JIGSAWS | Accuracy81.4 | 19 | |
| SST forecasting | OISST | RMSE0.682 | 18 | |
| Action Recognition | JIGSAWS Suturing (LOSO) | Per-frame Accuracy79.6 | 18 | |
| Action Segmentation | 50 Salads (eval setup) | Edit Distance61.1 | 9 | |
| Surgical Gesture Segmentation | JIGSAWS Kinematic suturing task | Accuracy79.6 | 9 | |
| Action Segmentation | GTEA | Accuracy66.1 | 7 | |
| Surgical Gesture Segmentation | JIGSAWS Video suturing task | Accuracy81.4 | 7 |