CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

About

Retrieval-augmented large language models, when optimized with outcome-level rewards, can achieve strong answer accuracy on multi-hop questions. However, under noisy retrieval, models frequently suffer from "right-answer-wrong-reason failures": they may exploit spurious shortcuts or produce reasoning traces weakly grounded in the supporting evidence. Furthermore, the lack of structured output control prevents reliable auditing of the underlying reasoning quality. To address this, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a reinforcement learning framework for the response generation stage of retrieval-augmented multi-hop question answering. CRAFT trains models to produce structured reasoning traces with configurable levels of auditability (e.g., by selectively retaining planning, evidence citation, or reasoning steps). Training combines two complementary forms of supervision: deterministic rewards enforce verifiable constraints, including format compliance, answer correctness, and citation-set validity, while a judge-based reward audits semantic faithfulness by evaluating reasoning consistency and evidence grounding. Experiments show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales. Notably, semantic judge-based rewards improve answer accuracy rather than compromise it, enabling CRAFT (7B) to achieve performance competitive with strong closed-source models.

Yu Liu, Wenxiao Zhang, Diandian Guo, Cong Cao, Fangfang Yuan, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma• 2026

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	HotpotQA (test)	F178	311
Multi-hop Question Answering	2WikiMHQA	F1 Score85.56	73
Multi-hop Question Answering	MuSiQue in-distribution	EM56.91	17
Multi-hop Question Answering	HotpotQA In-Distribution	Exact Match (EM)66.51	17
Multi-hop Question Answering	2WikiMHQA in-distribution	Exact Match (EM)79.39	17
Multi-hop Question Answering	MuSiQue v1 (test)	Exact Match (EM)56.25	17
Multi-hop Question Answering	2WikiMHQA (test)	EM67.15	17

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord