Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
About
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., MMLU-Pro and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Reasoning | MMLU | MMLU Accuracy77.9 | 156 | |
| Mathematical Reasoning | AMC | Pass@169.62 | 112 | |
| Mathematical Reasoning | AIME 2025 | Pass@130.34 | 96 | |
| Mathematical Reasoning | AIME 2024 | Pass@141.22 | 86 | |
| Mathematical Reasoning | MATH 500 | Mean@10.734 | 55 | |
| General Reasoning | GPQA | Accuracy30.3 | 36 | |
| Reasoning | MMLU-Pro | Pass@155.3 | 27 | |
| General Reasoning | GPQA | pass@145.2 | 26 | |
| Mathematical Reasoning | AMC | Mean Accuracy55 | 24 | |
| Mathematical Reasoning | AIME 2024 | Mean Accuracy26.3 | 24 |