Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

About

Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.

Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Yash Sinha, Pratik Narang, Dhruv Kumar, Jagat Sesh Challa• 2026

Related benchmarks

TaskDatasetResultRank
Continuous ControlMujoco
Ant-v51.29e+3
9
JumperProcgen easy
Mean Episode Return5.87
4
CoinRunProcgen easy
Mean Episode Return8.53
4
HeistProcgen easy
Mean Episode Return1.87
4
Continuous ControlDeepMind Control suite
Cartpole Swingup IQM799.4
4
NinjaProcgen easy
Mean Episode Return6.27
4
Showing 6 of 6 rows

Other info

Follow for update