Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
About
Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA (test) | Average Accuracy95.02 | 208 | |
| Conversational SQL | CoSQL | Accuracy65.83 | 14 | |
| Scientific Question Answering | SciQA | Accuracy92.48 | 13 | |
| Reasoning Question Answering | ARC | Accuracy79.93 | 7 | |
| Science Question Answering | OpenBookQA | Accuracy68.14 | 7 | |
| Mathematical Dialogue Evaluation | MathDial (test) | Accuracy9.67 | 7 | |
| Clarifying Questions | SciQA (test) | Accuracy26 | 6 | |
| Clarifying Questions | OpenBookQA (test) | Accuracy28 | 6 |