Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

About

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.

Tianxing Zhou, Feiyang Xue, Zhangchen Ye, Tianyuan Yuan, Hang Zhao, Tao Jiang• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationLIBERO
Spatial Success Rate90.7
314
Robotic ManipulationLIBERO-Plus
Average Score53
107
Robotic ManipulationManiSkill
Poke Cube78
4
Robotic ManipulationReal-World Robot Tasks (In-distribution)
Place Bowl Success Rate85
2
Robotic ManipulationReal-World Robot Tasks (Out-of-distribution)
Space Configuration65
2
Showing 5 of 5 rows

Other info

Follow for update