Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Extending Embodied Question Answering from Perception to Decision

About

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu• 2026

Related benchmarks

TaskDatasetResultRank
Embodied Reasoning and Question AnsweringERQA
Score54.5
35
Spatial ReasoningWhere2Place
Score67.08
17
Embodied Question Answering and Decision MakingEQA-Decision Benchmark
Static Scene Accuracy81.55
8
Video Question AnsweringRoboVQA
BLEU-186.97
5
Showing 4 of 4 rows

Other info

Follow for update