Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On the Generalization Gap in Self-Evolving Language Model Reasoning

About

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan, Andrew Tomkins, Tu Vu, Da-Cheng Juan, Cyrus Rashtchian• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy77.4
221
Mathematical ReasoningTabMWP
Accuracy92.3
203
Mathematical ReasoningMATH Hard
Accuracy55.1
198
Logical reasoningKK
Test Accuracy33.2
28
Logical reasoningKK 6–8 ppl.
Accuracy27.5
21
Logical reasoningKK 2–3 ppl.
Accuracy84.8
21
Logical reasoningKK 4–5 ppl.
Accuracy58.7
21
Logical reasoningKK All
Accuracy52.8
21
Showing 8 of 8 rows

Other info

Follow for update