Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data

About

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.

Yunhao Tang, Sid Wang, Lovish Madaan, R\'emi Munos• 2025

Related benchmarks

TaskDatasetResultRank
ReasoningBBH (test)
Accuracy40.2
40
General ReasoningMMLU-Pro
pass@1 Accuracy42.7
27
Long-form reasoningNuminaProof (test)
Avg LogProb (Per Answer)-1.0218
14
Long-form reasoningAlpaca
Avg LogProb per Answer-0.9443
14
General ReasoningGPQA
Average@431.6
7
Mathematical ReasoningCARP-EN
Average@20.634
7
Mathematical ReasoningSAT Math
Average@3293.7
7
Mathematical ReasoningAIME 2024
Average@324.8
7
Mathematical ReasoningMinerva
Average@423.6
7
General ReasoningTheoremQA
Average@231.9
7
Showing 10 of 11 rows

Other info

Follow for update