SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents
About
Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-horizon procedural planning | EgoPlan-Bench All | Success Rate58.72 | 13 | |
| Long-horizon procedural planning | EgoPlan-Bench In-Domain | Success Rate62.46 | 9 | |
| Long-horizon procedural planning | EgoPlan-Bench Out-of-Domain | Success Rate54.3 | 9 | |
| Video Reward Assessment | VideoGen-Reward Bench | VQ Accuracy (w/ Ties)49.79 | 9 | |
| Image-to-Video | ActWM-Bench | Aesthetic Quality55 | 8 | |
| Text-to-Video | ActWM-Bench | Aesthetic Quality0.568 | 8 |