Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

About

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningBLINK
Spa. Score78.3
26
Embodied Spatial Point ReasoningWhere2Place
Accuracy45.81
19
Visual Trace GenerationVABench VisualTrace
RMSE78.26
12
Spatial ReasoningSAT
Val Metric Score73.2
12
Spatial ReasoningEmbSp (test)
Test Accuracy63.3
12
Robotic ManipulationSimplerEnv WidowX Robot
Success Rate: Put Spoon on Towel41.6
12
Spatial ReasoningCRPE
Subject Accuracy75.2
12
Robotic ManipulationSimpler WidowX Simulation
Success Rate: Spoon on Towel41.7
12
Spatial ReasoningCVBench
Count Score62.4
12
Object ReferencingRoboRefIt (test)
Accuracy56.7
8
Showing 10 of 28 rows

Other info

Follow for update