Language Model Self-improvement by Reinforcement Learning Contemplation

About

Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing (NLP) tasks. However, fine-tuning these models often necessitates substantial supervision, which can be expensive and time-consuming to obtain. This paper introduces a novel unsupervised method called LanguageModel Self-Improvement by Reinforcement Learning Contemplation (SIRLC) that improves LLMs without reliance on external labels. Our approach is grounded in the observation that it is simpler for language models to assess text quality than to generate text. Building on this insight, SIRLC assigns LLMs dual roles as both student and teacher. As a student, the LLM generates answers to unlabeled questions, while as a teacher, it evaluates the generated text and assigns scores accordingly. The model parameters are updated using reinforcement learning to maximize the evaluation score. We demonstrate that SIRLC can be applied to various NLP tasks, such as reasoning problems, text generation, and machine translation. Our experiments show that SIRLC effectively improves LLM performance without external supervision, resulting in a 5.6% increase in answering accuracy for reasoning tasks and a rise in BERTScore from 0.82 to 0.86 for translation tasks. Furthermore, SIRLC can be applied to models of different sizes, showcasing its broad applicability.

Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, Yang Yu• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	854
Math Reasoning	GSM8K	Pass@4 Accuracy95.38	54
Math Reasoning	MATH 500	Pass@472.3	39
Multi-task Knowledge	MMLU-Pro	MMLU-Pro Score0.4867	33
Math Reasoning	AMC	Avg@8 Accuracy41.11	27
Code Generation	LiveCodeBench	Avg@5 Accuracy14.33	27
Math Reasoning	AIME 2025	Accuracy (avg@16)5.62	27
Code	CRUX	Accuracy @551.9	27
Math Reasoning	AIME 2024	Accuracy (avg@16)10.42	27
Math Reasoning	MATH 500	Success Rate (pass@4)82.2	27

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord