SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

About

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Liang Lv, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee• 2026

Related benchmarks

Task	Dataset	Result
Long-horizon procedural planning	EgoPlan-Bench All	Success Rate58.72	13
Long-horizon procedural planning	EgoPlan-Bench In-Domain	Success Rate62.46	9
Long-horizon procedural planning	EgoPlan-Bench Out-of-Domain	Success Rate54.3	9
Video Reward Assessment	VideoGen-Reward Bench	VQ Accuracy (w/ Ties)49.79	9
Image-to-Video	ActWM-Bench	Aesthetic Quality55	8
Text-to-Video	ActWM-Bench	Aesthetic Quality0.568	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord