Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

About

Self-play has recently emerged as a promising paradigm for post-training Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., a question), which it then addresses itself by producing a task output (e.g., an answer). A reward model evaluates the output, and the rewards are used to train the LLM, typically via Reinforcement Learning (RL). A key benefit of self-play for post-training LLMs is its minimal supervision costs: self-play avoids the need for high-quality input-output pairs traditionally constructed by humans or expensive proprietary models. Existing work, however, explores self-play only for verifiable tasks, such as math and coding, for which objective ground truth is available and easily checkable. In this paper, we seek to extend self-play to more realistic open-ended tasks. We propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics along with each input-output pair. The rubric is used to evaluate outputs and train the model. Crucially, we ground the framework on a content-rich pretraining corpus to (1) enable an exploitable generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both the pretrained base model and instruction-tuned model on multiple tasks ranging from long-form healthcare QA to creative writing and instruction following.

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)75.75	600
Mathematical Reasoning	AIME 2024	Accuracy12.03	525
Mathematical Reasoning	AIME 2025	Accuracy9.84	353
Mathematical Reasoning	GSM8K	Accuracy (Acc)90.24	352
Instruction Following	Arena Hard	Win Rate54.09	263
Medical Question Answering	MedQA	Accuracy56.54	145
Multiple-choice Question Answering	MMLU-Pro	MMLU-Pro Overall Accuracy56.62	138
Question Answering	MedQA	Accuracy56.47	86
Question Answering	GPQA Diamond	Accuracy38.38	65
Language Understanding	MMLU-Pro	MMLU-Pro Accuracy56.44	60

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord