Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
About
Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval 2.0 | -- | 722 | |
| Multi-turn conversation | MT-Bench | Average Score84.5 | 107 | |
| Factuality Evaluation | TruthfulQA | -- | 103 | |
| Code Generation | LiveCodeBench Hard | Pass@19.7 | 26 | |
| Code Generation | LiveCodeBench Medium | -- | 23 | |
| Science Question Answering | GPQA | Score32.83 | 16 | |
| Domain-specific Alignment | IID prompt set (held-out) | AvgR-1.94 | 13 | |
| Generalist Alignment | AlpacaEval Generalist alignment prompts 2.0 | Average Rating0.66 | 13 | |
| Code Generation | LiveCodeBench Easy | Pass@165.3 | 11 |