Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Guided Self-Evolving LLMs with Minimal Human Supervision

About

AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC
Accuracy72.3
151
Mathematical ReasoningMinerva--
138
Mathematical ReasoningOlympiad
Accuracy46.4
92
General ReasoningMMLU-Pro
MMLU-Pro General Reasoning Avg@8 Acc63.2
51
Mathematical ReasoningMathematical Reasoning Benchmarks (GSM8K, MATH, AMC23, Olympiad, Minerva) (test)
GSM8K Accuracy94
32
ReasoningGPQA D
Accuracy46.5
29
ReasoningReasoning Benchmark Suite Aggregate
Average Score56.7
26
General ReasoningGeneral Reasoning Suite MMLU Pro, Super GPQA, GPQA Diamond, BBEH
MMLU Pro62.8
19
General ReasoningBBEH
Accuracy12.3
19
General ReasoningSuper GPQA--
16
Showing 10 of 10 rows

Other info

Follow for update