Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
About
Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Reasoning | BBH | Accuracy91.85 | 190 | |
| Mathematical Reasoning | Minerva | Accuracy (Acc)89.16 | 146 | |
| Mathematical Reasoning | GSM8K | Accuracy94.4 | 95 | |
| General Reasoning | GPQA | Accuracy84.85 | 59 | |
| Mathematics | OlympiadBench | Pass@1 Accuracy78.72 | 51 | |
| Reasoning | OBQA | Accuracy97.67 | 46 | |
| Math | Minerva | Accuracy89.16 | 40 | |
| Code Generation | MBPP+ | Pass@167.5 | 40 | |
| General Reasoning | ARC-C | Accuracy97.4 | 35 | |
| Code Generation | HumanEval+ | Pass@173.8 | 34 |