Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

About

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy81.33
479
Mathematical ReasoningAIME 2025
Accuracy68.89
311
Mathematical ReasoningMATH 500
Accuracy80.6
221
Mathematical ReasoningGSM8K
Accuracy95.94
108
Scientific ReasoningTheoremQA
Accuracy42.57
68
Scientific ReasoningGPQA Diamond
Accuracy59.43
62
Mathematical Reasoning VerificationMATH 500
Accuracy (MATH 500)74
3
Showing 7 of 7 rows

Other info

Follow for update