Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

About

General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/

Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo• 2025

Related benchmarks

Task	Dataset	Result
Spatial Reasoning	EmbSpatial	Overall Accuracy70.66	131
Spatial Reasoning	CV-Bench	Accuracy83.89	89
Embodied AI reasoning	ERIQ 6K (full)	Action Understanding96.67	10
Spatial Reasoning	BLINK-R	Accuracy83.87	10
Spatial Perception and Grounding	ERIQ ER-6K	Scene Understanding84.18	10
Spatial Reasoning	BLINK-S	Accuracy74.83	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord