Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy

About

Recently, 3D vision-based diffusion policies have shown strong capability in learning complex robotic manipulation skills. However, a common architectural mismatch exists in these models: a tiny yet efficient point-cloud encoder is often paired with a massive decoder. Given a compact scene representation, we argue that this may lead to substantial parameter waste in the decoder. Motivated by this observation, we propose PocketDP3, a pocket-scale 3D diffusion policy that replaces the heavy conditional U-Net decoder used in prior methods with a lightweight Diffusion Mixer (DiM) built on MLP-Mixer blocks. This architecture enables efficient fusion across temporal and channel dimensions, significantly reducing model size. Notably, without any additional consistency distillation techniques, our method supports two-step inference without sacrificing performance, improving practicality for real-time deployment. Across three simulation benchmarks--RoboTwin2.0, Adroit, and MetaWorld--PocketDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior methods, while also accelerating inference. Real-world experiments further demonstrate the practicality and transferability of our method in real-world settings. Code will be released.

Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationRoboTwin 2.0
Pick Diverse Bottles Success Rate77
17
Robotic ManipulationAdroit and MetaWorld
Average Success Rate77.4
13
3D Visuomotor Policy Inference EfficiencyAdroit and MetaWorld
Params (M)0.53
7
Adjust BottleReal-world Experiments 15 trials (test)
Success Rate46.7
2
Place ObjectReal-world Experiments 15 trials (test)
Success Rate73.3
2
Stack Blocks TwoReal-world Experiments 15 trials (test)
Success Rate20
2
Showing 6 of 6 rows

Other info

Follow for update