Extending Embodied Question Answering from Perception to Decision

About

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu• 2026

Related benchmarks

Task	Dataset	Result
Embodied Reasoning and Question Answering	ERQA	Score54.5	53
Spatial Reasoning	Where2Place	Score67.08	17
Embodied Question Answering and Decision Making	EQA-Decision Benchmark	Static Scene Accuracy81.55	8
Video Question Answering	RoboVQA	BLEU-186.97	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord