Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

About

Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.

Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringHotpotQA (test)
F178
198
Multi-hop Question Answering2WikiMHQA
F1 Score85.56
55
Multi-hop Question AnsweringMuSiQue in-distribution
EM56.91
17
Multi-hop Question AnsweringHotpotQA In-Distribution
Exact Match (EM)66.51
17
Multi-hop Question Answering2WikiMHQA in-distribution
Exact Match (EM)79.39
17
Multi-hop Question AnsweringMuSiQue v1 (test)
Exact Match (EM)56.25
17
Multi-hop Question Answering2WikiMHQA (test)
EM67.15
17
Showing 7 of 7 rows

Other info

Follow for update