ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
About
Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Procedure Planning | CrossTask | Success Rate (SR)39.75 | 43 | |
| Procedure Planning | COIN T=3 (test) | SR0.3399 | 40 | |
| Procedure Planning | NIV T=3 (test) | SR32.37 | 30 | |
| Procedure Planning | CrossTask T=3 (test) | SR38.45 | 27 | |
| Procedure Planning | NIV | Success Rate (SR)34.44 | 26 | |
| Visual Planning | CrossTask | Success Rate (SR)38.45 | 22 | |
| Visual Planning | COIN | Success Rate (SR)33.99 | 22 | |
| Procedure Planning | EgoPER | Success Rate (SR)51.84 | 8 | |
| Procedure Planning | CrossTask T=3 | Success Rate (SR)39.75 | 7 | |
| Procedure Planning | CrossTask T=4 | SR0.2419 | 7 |