Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

About

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Liang Lv, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee• 2026

Related benchmarks

TaskDatasetResultRank
Long-horizon procedural planningEgoPlan-Bench All
Success Rate58.72
13
Long-horizon procedural planningEgoPlan-Bench In-Domain
Success Rate62.46
9
Long-horizon procedural planningEgoPlan-Bench Out-of-Domain
Success Rate54.3
9
Video Reward AssessmentVideoGen-Reward Bench
VQ Accuracy (w/ Ties)49.79
9
Image-to-VideoActWM-Bench
Aesthetic Quality55
8
Text-to-VideoActWM-Bench
Aesthetic Quality0.568
8
Showing 6 of 6 rows

Other info

Follow for update