Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

About

Recent advances in video diffusion models shows promise for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex manipulation. This limitation arises from entangled features in overlapping regions, leading to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics via a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction, and models each phase using the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction. This design effectively alleviates the multi-object feature fusion issue in prior work. To further ensure subject semantic consistency across the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge dataset, as well as RLBench and SIMPLER benchmarks, demonstrate that our method establishs new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation. Project Page: https://fuxiao0719.github.io/projects/robomaster/

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin• 2025

Related benchmarks

TaskDatasetResultRank
Video GenerationShort-horizon tasks (test)
Aesthetic Quality50.2
8
Human-to-Robot Video TranslationDexYCB
Motion Consistency2.6
4
Hand-object manipulation video generationDexYCB--
4
Showing 3 of 3 rows

Other info

Follow for update