From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
About
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Spatial Reasoning | BLINK | Spa. Score78.3 | 26 | |
| Embodied Spatial Point Reasoning | Where2Place | Accuracy45.81 | 19 | |
| Visual Trace Generation | VABench VisualTrace | RMSE78.26 | 12 | |
| Spatial Reasoning | SAT | Val Metric Score73.2 | 12 | |
| Spatial Reasoning | EmbSp (test) | Test Accuracy63.3 | 12 | |
| Robotic Manipulation | SimplerEnv WidowX Robot | Success Rate: Put Spoon on Towel41.6 | 12 | |
| Spatial Reasoning | CRPE | Subject Accuracy75.2 | 12 | |
| Robotic Manipulation | Simpler WidowX Simulation | Success Rate: Spoon on Towel41.7 | 12 | |
| Spatial Reasoning | CVBench | Count Score62.4 | 12 | |
| Object Referencing | RoboRefIt (test) | Accuracy56.7 | 8 |