Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

About

Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.

Claudio Fanconi, Nicol\'as Astorga, Mihaela van der Schaar• 2025

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	MMLU-Pro	Pass@155.6	32
Clinical Reasoning	MEDREASON	Pass@173.1	15
Mathematical Reasoning	AIME 2025	Reward-weighted Pass@14.3	10
Mathematical Reasoning	AIME 2024	Reward-weighted Pass@13.43	10
Mathematical Reasoning	GSM8K	Random Baseline90.4	9
General Knowledge Reasoning	MMLU-Pro	Best-of-16 Delta9.3	6
Mathematical Reasoning	GSM8K	Best-of-16 Delta13	6
Medical Reasoning	MEDREASON	Best-of-16 Delta (Δ)16.4	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord