Temporal Convolutional Networks for Action Segmentation and Detection
About
The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We introduce a new class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Segmentation | 50Salads | Edit Distance43.1 | 114 | |
| Action Segmentation | Breakfast | -- | 107 | |
| Temporal action segmentation | 50Salads | Accuracy80.7 | 106 | |
| Temporal action segmentation | GTEA | F1 Score @ 10% Threshold85.8 | 99 | |
| Temporal action segmentation | Breakfast | Accuracy43.3 | 96 | |
| Action Segmentation | GTEA | F1@10%72.2 | 39 | |
| EV charging demand forecasting | Palo Alto (test) | MSE1.40e+3 | 38 | |
| Temporal action segmentation | 50 Salads granularity (Eval) | MoF73.4 | 24 | |
| Action Segmentation | Breakfast Action dataset | MoF43.3 | 22 | |
| Action Segmentation | 50Salads mid granularity | MoF64.7 | 19 |