Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

About

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu• 2026

Related benchmarks

Task	Dataset	Result
General Reasoning	BBH	Accuracy91.85	190
Mathematical Reasoning	Minerva	Accuracy (Acc)89.16	146
Mathematical Reasoning	GSM8K	Accuracy94.4	98
General Reasoning	GPQA	Accuracy84.85	59
Mathematics	OlympiadBench	Pass@1 Accuracy78.72	51
Code Generation	MBPP+	Pass@167.5	51
Reasoning	OBQA	Accuracy97.67	46
Math	Minerva	Accuracy89.16	40
General Reasoning	ARC-C	Accuracy97.4	35
Code Generation	HumanEval+	Pass@173.8	34

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord