Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning
About
Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scientific Reasoning | MMLU-Pro | Pass@155.6 | 32 | |
| Clinical Reasoning | MEDREASON | Pass@173.1 | 15 | |
| Mathematical Reasoning | AIME 2025 | Reward-weighted Pass@14.3 | 10 | |
| Mathematical Reasoning | AIME 2024 | Reward-weighted Pass@13.43 | 10 | |
| Mathematical Reasoning | GSM8K | Random Baseline90.4 | 9 | |
| General Knowledge Reasoning | MMLU-Pro | Best-of-16 Delta9.3 | 6 | |
| Mathematical Reasoning | GSM8K | Best-of-16 Delta13 | 6 | |
| Medical Reasoning | MEDREASON | Best-of-16 Delta (Δ)16.4 | 6 |