Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Coupled Variational Reinforcement Learning for Language Model General Reasoning

About

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang• 2025

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU-Pro
pass@1 Accuracy46.5
27
General ReasoningTheoremQA
Average@236.3
7
Mathematical ReasoningAIME 2024
Average@327.5
7
Mathematical ReasoningCARP-EN
Average@20.651
7
Mathematical ReasoningMATH 500
Average@466.3
7
Mathematical ReasoningMinerva
Average@425.5
7
Mathematical ReasoningSAT Math
Average@3297.1
7
General ReasoningGPQA
Average@430.4
7
Showing 8 of 8 rows

Other info

Follow for update