Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unified Video Action Model

About

A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement93.6
494
Robotic ManipulationLIBERO-10
Success Rate90
21
Robotic ManipulationCalvin ABC->D
Task-1 Score87.5
16
Robot ManipulationLIBERO LONG (test)
Success Rate90
15
Kitchen manipulationRoboCasa 24 kitchen manipulation tasks
Average Success Rate50
12
PushTPushT Variant Original
mIoU94
6
PushTPushT Variant Rand Light
mIoU54
6
PushTPushT Variant Rand Color
mIoU13
6
PushTVariant PushT Texture
mIoU11
6
Robotic ManipulationLibero90
Pick Success Rate67.6
5
Showing 10 of 12 rows

Other info

Follow for update