PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

About

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

Shizhe Chen, Paul Pacaud, Cordelia Schmid• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate97.4	570
Robotic Manipulation	RLBench 10 tasks	Take Umbrella Success Rate99	20
Put Grapes and Banana in Plates	UR5 robot dataset (test)	Success Rate4	4
Stack Yellow Cup Onto Pink Cup	UR5 robot dataset (test)	Success Rate70	4
Close Drawer	UR5 robot dataset (test)	Success Rate70	4
Open microwave	SO-100	Success Rate80	3
Put Sock In Drawer	SO-100	Success Rate9	3
Put Banana In Plate	SO-100	Success Rate100	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord