Language Models that Think, Chat Better
About
Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | IFEval | IFEval Accuracy86.32 | 836 | |
| Mathematical Reasoning | Minerva | Pass@1 Accuracy34.56 | 289 | |
| Mathematical Reasoning | AIME 2024 | Pass@1 Accuracy20.94 | 236 | |
| Mathematical Reasoning | GSM8K | -- | 204 | |
| Mathematical Reasoning | AIME 2025 | Pass@1 Accuracy16.8 | 192 | |
| Mathematical Reasoning | AMC | Pass@1 Accuracy63.95 | 119 | |
| Scientific Reasoning | GPQA Diamond | Score43.94 | 68 | |
| Mathematical Reasoning | MATH 500 | Pass@1 Accuracy82.6 | 59 | |
| Dialogue | MT-Bench | MT-Bench Score7.812 | 41 | |
| Mathematical Reasoning | Olympiad Bench | Accuracy Pass@144.15 | 27 |