Maximum Likelihood Reinforcement Learning
About
Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | IFEval | IFEval Accuracy66.12 | 836 | |
| Mathematical Reasoning | Mathematics Benchmarks Average | Pass@148.6 | 44 | |
| Mathematical Reasoning | Minerva | Average Score (@32)48.9 | 20 | |
| Mathematical Reasoning | AMC 23 | Avg@3293.4 | 20 | |
| Mathematical Reasoning | Olympiad | Avg@453.6 | 20 | |
| Mathematical Reasoning | AIME 24 | Avg@3248.6 | 20 | |
| Mathematical Reasoning | HMMT 25 | Pass@17.9 | 14 | |
| Factual Grounding | FACTS grounding v2 | Factual Grounding Score (FACTS-v2)14.19 | 12 | |
| Conversational Alignment | Alpaca-Evals | Alpaca-Evals Score55.33 | 12 | |
| General Alignment | FACTS-grounding Alpaca-Evals and IFEval Aggregate v2 | Mean Score44.51 | 12 |