Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Co-Evolving Latent Action World Models

About

Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pretrained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian• 2025

Related benchmarks

TaskDatasetResultRank
Block-stackingVideo-CraftBench
Success Rate (Human)54.1
14
Sequential Paper FoldingVideo-CraftBench
Step 1 Success Rate83.5
14
Video GenerationVideo-CraftBench
SSIM66.8
14
Video PredictionLIBERO
PSNR25.85
9
Video PredictionOXE
PSNR22.57
5
Video PredictionAgibot
PSNR23.93
5
Video PredictionEgoCentric
PSNR23.69
5
Video PredictionRoboDesk
PSNR24.29
4
Visual PlanningRoboDesk VP2 benchmark
Upright Block Success Rate35.33
3
Showing 9 of 9 rows

Other info

Follow for update