BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

About

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA. It demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset are available at https://drags99.github.io/bridge-eqa/

Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere• 2025

Related benchmarks

Task	Dataset	Result
Embodied Question Answering	BridgeEQA (test)	Image Citation Relevance88.9	15
Embodied Question Answering	BridgeEQA 1,100 QA pairs (test)	Answer Correctness64.8	15
Condition Rating	BridgeEQA (instances with < 30 images)	Exact Match Accuracy40.9	9
Condition Rating	BridgeEQA fewer than 30 images	Condition Rating Accuracy (±1)81.8	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord