PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy

About

Recently, 3D vision-based diffusion policies have shown strong capability in learning complex robotic manipulation skills. However, a common architectural mismatch exists in these models: a tiny yet efficient point-cloud encoder is often paired with a massive decoder. Given a compact scene representation, we argue that this may lead to substantial parameter waste in the decoder. Motivated by this observation, we propose PocketDP3, a pocket-scale 3D diffusion policy that replaces the heavy conditional U-Net decoder used in prior methods with a lightweight Diffusion Mixer (DiM) built on MLP-Mixer blocks. This architecture enables efficient fusion across temporal and channel dimensions, significantly reducing model size. Notably, without any additional consistency distillation techniques, our method supports two-step inference without sacrificing performance, improving practicality for real-time deployment. Across three simulation benchmarks--RoboTwin2.0, Adroit, and MetaWorld--PocketDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior methods, while also accelerating inference. Real-world experiments further demonstrate the practicality and transferability of our method in real-world settings. Code will be released.

Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	RoboTwin 2.0	Average Success Rate71.6	100
Robotic Manipulation	Adroit and MetaWorld	Average Success Rate77.4	28
3D Visuomotor Policy Inference Efficiency	Adroit and MetaWorld	Params (M)0.53	7
Adjust Bottle	Real-world Experiments 15 trials (test)	Success Rate46.7	2
Place Object	Real-world Experiments 15 trials (test)	Success Rate73.3	2
Stack Blocks Two	Real-world Experiments 15 trials (test)	Success Rate20	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord