Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

About

General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/

Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningCV-Bench
Accuracy83.89
46
Spatial ReasoningEmbSpatial
Overall Accuracy70.66
30
Embodied AI reasoningERIQ 6K (full)
Action Understanding96.67
10
Spatial ReasoningBLINK-R
Accuracy83.87
10
Spatial Perception and GroundingERIQ ER-6K
Scene Understanding84.18
10
Spatial ReasoningBLINK-S
Accuracy74.83
10
Showing 6 of 6 rows

Other info

Follow for update