Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

About

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu• 2026

Related benchmarks

TaskDatasetResultRank
General ReasoningBBH
Accuracy91.85
190
Mathematical ReasoningMinerva
Accuracy (Acc)89.16
146
Mathematical ReasoningGSM8K
Accuracy94.4
95
General ReasoningGPQA
Accuracy84.85
59
MathematicsOlympiadBench
Pass@1 Accuracy78.72
51
ReasoningOBQA
Accuracy97.67
46
MathMinerva
Accuracy89.16
40
Code GenerationMBPP+
Pass@167.5
40
General ReasoningARC-C
Accuracy97.4
35
Code GenerationHumanEval+
Pass@173.8
34
Showing 10 of 18 rows

Other info

Follow for update