Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
About
Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Step Forecasting | COIN | Accuracy40.2 | 22 | |
| Keystep recognition | COIN (test) | Accuracy16.9 | 18 | |
| Keystep recognition | CrossTask (test) | Accuracy28.9 | 18 | |
| Keystep recognition | CrossTask | Accuracy64.5 | 17 | |
| Keystep recognition | COIN | Accuracy57.2 | 14 | |
| Task recognition | COIN | Accuracy90.5 | 14 | |
| Next forecasting | COIN (test) | Top-1 Accuracy40.2 | 13 | |
| Keystep forecasting | CrossTask | Accuracy30.2 | 12 | |
| Task recognition | CrossTask | Accuracy97.1 | 12 | |
| Step Recognition | COIN (test) | Top-1 Acc57.2 | 11 |