On the Generalization Gap in Self-Evolving Language Model Reasoning

About

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan, Andrew Tomkins, Tu Vu, Da-Cheng Juan, Cyrus Rashtchian• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy77.4	589
Mathematical Reasoning	TabMWP	Accuracy92.3	210
Mathematical Reasoning	MATH Hard	Accuracy55.1	208
Logical reasoning	KK	Test Accuracy33.2	28
Logical reasoning	KK 6–8 ppl.	Accuracy27.5	21
Logical reasoning	KK 2–3 ppl.	Accuracy84.8	21
Logical reasoning	KK 4–5 ppl.	Accuracy58.7	21
Logical reasoning	KK All	Accuracy52.8	21

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord