Maximum Likelihood Reinforcement Learning

About

Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy66.12	854
Mathematical Reasoning	Mathematics Benchmarks Average	Pass@148.6	44
Mathematical Reasoning	Minerva	Average Score (@32)48.9	20
Mathematical Reasoning	AMC 23	Avg@3293.4	20
Mathematical Reasoning	Olympiad	Avg@453.6	20
Mathematical Reasoning	AIME 24	Avg@3248.6	20
Mathematical Reasoning	HMMT 25	Pass@17.9	14
Factual Grounding	FACTS grounding v2	Factual Grounding Score (FACTS-v2)14.19	12
Conversational Alignment	Alpaca-Evals	Alpaca-Evals Score55.33	12
General Alignment	FACTS-grounding Alpaca-Evals and IFEval Aggregate v2	Mean Score44.51	12

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord