SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

About

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Multi-hop Question Answering	HotpotQA	F1 Score77.4		294
Multi-hop Question Answering	2Wiki	Exact Match72.8		215

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord