Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

About

Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA (test)	Average Accuracy95.02	273
Science Question Answering	OpenBookQA	Accuracy68.14	82
Scientific Question Answering	SciQA	Accuracy92.48	35
Reasoning Question Answering	ARC	Accuracy79.93	21
Conversational SQL	CoSQL	Accuracy65.83	14
Personalized Response Generation	AIME Implicit Preference 2025 (test)	Preference Score0.219	8
Personalized Response Generation	StereoSet Explicit Preference (test)	Preference Score76.5	8
Personalized Response Generation	AIME Explicit Preference 2025 (test)	Pref41.8	8
Personalized Response Generation	StereoSet Implicit Preference (test)	Pref Score0.422	8
Mathematical Dialogue Evaluation	MathDial (test)	Accuracy9.67	7

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord