HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter
About
Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($\pi_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Aggregate tasks | Low Density Clutter | Aggr. Score90.7 | 7 | |
| Aggregate tasks | High Density Clutter | Aggregate Score86.7 | 7 | |
| Bimanual Manipulation | Low Density Clutter | Success Rate @ 100 Steps96 | 7 | |
| Bimanual Manipulation | High Density Clutter | Success Rate @ 100 steps97 | 7 | |
| Grasp | High Density Clutter | Success Rate @30085 | 7 | |
| Place | Low Density Clutter | SR@20084 | 7 | |
| Place | High Density Clutter | SR@20078 | 7 | |
| Grasp | Low Density Clutter | Success Rate @ 30092 | 7 | |
| Clutter sorting | Long-horizon manipulation Clutter sorting | Success Rate @ 50 steps72 | 2 | |
| Restocking | Long-horizon manipulation Restocking | Success Rate @ 5066 | 2 |