Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

About

Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.

Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng, Zhexin Zhang, Shiyao Cui, Caishun Chen, Tiantian He, Hongning Wang, Yew-Soon Ong, Minlie Huang• 2025

Related benchmarks

TaskDatasetResultRank
CalibrationNQ
ECE0.561
55
Question AnsweringPopQA
Score35
50
CalibrationWebQ
ECE41.43
31
CalibrationSQuAD
ECE65.13
31
Mathematical ReasoningGSM8K
Accuracy21.08
29
Knowledge Grounded DialogueWoW
F1 Score15.94
15
Slot FillingT-REx
Accuracy44.57
14
Fact VerificationFEVER
Accuracy72.2
11
Expected Calibration ErrorSeaQA
ECE18.55
10
Expected Calibration ErrorTriQA
Expected Calibration Error13.78
10
Showing 10 of 17 rows

Other info

Follow for update