Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Maximum Likelihood Reinforcement Learning

About

Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
IFEval Accuracy66.12
836
Mathematical ReasoningMathematics Benchmarks Average
Pass@148.6
44
Mathematical ReasoningMinerva
Average Score (@32)48.9
20
Mathematical ReasoningAMC 23
Avg@3293.4
20
Mathematical ReasoningOlympiad
Avg@453.6
20
Mathematical ReasoningAIME 24
Avg@3248.6
20
Mathematical ReasoningHMMT 25
Pass@17.9
14
Factual GroundingFACTS grounding v2
Factual Grounding Score (FACTS-v2)14.19
12
Conversational AlignmentAlpaca-Evals
Alpaca-Evals Score55.33
12
General AlignmentFACTS-grounding Alpaca-Evals and IFEval Aggregate v2
Mean Score44.51
12
Showing 10 of 25 rows

Other info

Follow for update