Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

About

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

Shizhe Chen, Paul Pacaud, Cordelia Schmid• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationLIBERO
Spatial Success Rate97.4
527
Robotic ManipulationRLBench 10 tasks
Take Umbrella Success Rate99
20
Put Grapes and Banana in PlatesUR5 robot dataset (test)
Success Rate4
4
Stack Yellow Cup Onto Pink CupUR5 robot dataset (test)
Success Rate70
4
Close DrawerUR5 robot dataset (test)
Success Rate70
4
Open microwaveSO-100
Success Rate80
3
Put Sock In DrawerSO-100
Success Rate9
3
Put Banana In PlateSO-100
Success Rate100
3
Showing 8 of 8 rows

Other info

Follow for update