Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

About

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval 2.0--
722
Multi-turn conversationMT-Bench
Average Score84.5
107
Factuality EvaluationTruthfulQA--
103
Code GenerationLiveCodeBench Hard
Pass@19.7
26
Code GenerationLiveCodeBench Medium--
23
Science Question AnsweringGPQA
Score32.83
16
Domain-specific AlignmentIID prompt set (held-out)
AvgR-1.94
13
Generalist AlignmentAlpacaEval Generalist alignment prompts 2.0
Average Rating0.66
13
Code GenerationLiveCodeBench Easy
Pass@165.3
11
Showing 9 of 9 rows

Other info

Follow for update