Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

General Exploratory Bonus for Optimistic Exploration in RLHF

About

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

Wendi Li, Changdae Oh, Sharon Li• 2025

Related benchmarks

TaskDatasetResultRank
RLHF AlignmentUltraFeedback In-domain v1 (test)
Win Rate81
46
Mathematics ReasoningAIME 2025
Pass@1629.48
10
Math ReasoningMATH500
Pass@1693
9
LLM AlignmentUltraFeedback (in-domain)
Win Rate (KL, alpha=1)80.6
8
Instruction FollowingAlpacaEval (OOD)
KL Div (α=1)24.9
5
Mathematical ReasoningMATH OOD
KL Div (alpha=1)71
5
Math ReasoningOlympiadBench
Pass@1665.78
4
Showing 7 of 7 rows

Other info

Follow for update