Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

About

We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee• 2026

Related benchmarks

TaskDatasetResultRank
Long-horizon procedural planningEgoPlan-Bench All
Success Rate58.72
13
Long-horizon procedural planningEgoPlan-Bench In-Domain
Success Rate62.46
9
Long-horizon procedural planningEgoPlan-Bench Out-of-Domain
Success Rate54.3
9
Video Reward AssessmentVideoGen-Reward Bench
VQ Accuracy (w/ Ties)49.79
9
Image-to-VideoActWM-Bench
Aesthetic Quality55
8
Text-to-VideoActWM-Bench
Aesthetic Quality0.568
8
Showing 6 of 6 rows

Other info

Follow for update