Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
About
We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH (test) | Accuracy40.2 | 40 | |
| General Reasoning | MMLU-Pro | pass@1 Accuracy42.7 | 27 | |
| Long-form reasoning | NuminaProof (test) | Avg LogProb (Per Answer)-1.0218 | 14 | |
| Long-form reasoning | Alpaca | Avg LogProb per Answer-0.9443 | 14 | |
| General Reasoning | GPQA | Average@431.6 | 7 | |
| Mathematical Reasoning | CARP-EN | Average@20.634 | 7 | |
| Mathematical Reasoning | SAT Math | Average@3293.7 | 7 | |
| Mathematical Reasoning | AIME 2024 | Average@324.8 | 7 | |
| Mathematical Reasoning | Minerva | Average@423.6 | 7 | |
| General Reasoning | TheoremQA | Average@231.9 | 7 |