Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deep Think with Confidence

About

Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.

Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy83.3
499
Mathematical ReasoningAIME 25
Accuracy87.4
201
Mathematical ReasoningAIME24
Accuracy94.7
160
Visual Grounded ReasoningTreeBench
Overall Score49.9
128
Multimodal ReasoningLogicVista
Accuracy56
99
Mathematical ReasoningHMMT25
Accuracy86.7
95
Mathematical ReasoningAIME 24
AIME 24 Accuracy92
84
Mathematical ReasoningMathVista mini (test)
Accuracy70.7
75
High-resolution Visual UnderstandingHR-Bench-8K
FSP93
73
Mathematical ReasoningHMMT 2025
Accuracy73.8
70
Showing 10 of 78 rows
...

Other info

Follow for update