Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

About

General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/

Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningEmbSpatial
Overall Accuracy70.66
63
Spatial ReasoningCV-Bench
Accuracy83.89
61
Embodied AI reasoningERIQ 6K (full)
Action Understanding96.67
10
Spatial ReasoningBLINK-R
Accuracy83.87
10
Spatial Perception and GroundingERIQ ER-6K
Scene Understanding84.18
10
Spatial ReasoningBLINK-S
Accuracy74.83
10
Showing 6 of 6 rows

Other info

Follow for update