Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LAD: Learning Advantage Distribution for Reasoning

About

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

Wendi Li, Sharon Li• 2026

Related benchmarks

TaskDatasetResultRank
Code ReasoningLiveCodeBench
Avg@1633.51
6
Code ReasoningHumanEval+
Average Score @1682.29
6
Code ReasoningCodeForces
Rating1.53e+3
6
Math ReasoningMATH
Average Success Rate (Avg@32)77.66
6
Math ReasoningOlympiad Bench
Avg@3241.25
6
Math ReasoningAIME 2024
Avg Recall@3219.6
6
Math ReasoningAMC
Avg@3256.45
6
Math ReasoningMath Reasoning Aggregate
Avg@3240.08
6
Mathematical ReasoningAIME 2024
Pass@3247.19
6
Mathematical ReasoningAMC
Pass@3288.86
6
Showing 10 of 17 rows

Other info

Follow for update