Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

About

We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. This protects the model from distractions of visually inconsistent or degenerated alignments without the need of temporal supervision. We further extend our framework to the semi-supervised case when a few frames are sparsely annotated in a video. With less than 1% of labeled frames per video, our method is able to outperform existing semi-supervised approaches and achieve comparable performance to that of fully supervised approaches.

De-An Huang, Li Fei-Fei, Juan Carlos Niebles• 2016

Related benchmarks

TaskDatasetResultRank
Temporal action segmentationBreakfast
Accuracy27.7
96
Action SegmentationBreakfast (test)
MoF27.7
31
Action SegmentationBreakfast 14
MoF27.7
26
Action SegmentationBreakfast Action dataset
MoF27.7
22
Action Segmentation50Salads mid granularity
MoF11.9
19
Generic Event Boundary DetectionTAPOS (val)
F1 Score @ 0.0524.4
18
Action AlignmentBreakfast
IoD45
18
Action AlignmentHollywood Extended
IoD41
15
Action AlignmentHollywood Extended (test)
IoD41
12
Generic Event Boundary DetectionTAPOS
Recall @ 0.0559.6
9
Showing 10 of 20 rows

Other info

Follow for update