Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
About
Self-play has recently emerged as a promising paradigm for post-training Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., a question), which it then addresses itself by producing a task output (e.g., an answer). A reward model evaluates the output, and the rewards are used to train the LLM, typically via Reinforcement Learning (RL). A key benefit of self-play for post-training LLMs is its minimal supervision costs: self-play avoids the need for high-quality input-output pairs traditionally constructed by humans or expensive proprietary models. Existing work, however, explores self-play only for verifiable tasks, such as math and coding, for which objective ground truth is available and easily checkable. In this paper, we seek to extend self-play to more realistic open-ended tasks. We propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics along with each input-output pair. The rubric is used to evaluate outputs and train the model. Crucially, we ground the framework on a content-rich pretraining corpus to (1) enable an exploitable generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both the pretrained base model and instruction-tuned model on multiple tasks ranging from long-form healthcare QA to creative writing and instruction following.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)75.75 | 543 | |
| Mathematical Reasoning | AIME 2024 | Accuracy12.03 | 479 | |
| Mathematical Reasoning | GSM8K | Accuracy (Acc)90.24 | 337 | |
| Mathematical Reasoning | AIME 2025 | Accuracy9.84 | 311 | |
| Instruction Following | Arena Hard | Win Rate54.09 | 263 | |
| Multiple-choice Question Answering | MMLU-Pro | MMLU-Pro Overall Accuracy56.62 | 130 | |
| Medical Question Answering | MedQA | Accuracy56.54 | 124 | |
| Question Answering | MedQA | Accuracy56.47 | 86 | |
| Question Answering | GPQA Diamond | Accuracy38.38 | 61 | |
| Language Understanding | MMLU-Pro | MMLU-Pro Accuracy56.44 | 60 |