GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
About
Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval LC 2 | Win Rate51.1 | 49 | |
| Reasoning | GPQA | Accuracy17.7 | 37 | |
| Chat | AlpacaEval LC 2 | LC Win Rate35.7 | 16 | |
| Scientific Reasoning | GPQA | Accuracy48 | 6 | |
| General chat | AE2 LC | Win Rate57.8 | 6 | |
| Graduate-level Science QA | GPQA | Accuracy29.3 | 6 | |
| General Multitask Evaluation | Aggregated Benchmarks Math500, GPQA, HumanEval, MBPP, AE2 LC | Average Score40.7 | 5 | |
| Mathematical Reasoning | Minerva Math | Score42.6 | 4 | |
| Mathematical Reasoning | OlympiadBench AIME24 AIME25 Minerva Average | Average Score27.7 | 4 | |
| Mathematical Reasoning | OlympiadBench | Score43.4 | 4 |