Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

About

Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

Jianxun Wang, Grant C. Forbes, Leonardo Villalobos-Arias, David L. Roberts• 2026

Related benchmarks

TaskDatasetResultRank
Continuous ControlMuJoCo Ant v4
Normalized Return136
24
Continuous ControlMuJoCo Walker2d v4
Normalized Performance1.30e+4
24
Hand ManipulationAdroit pen-human
Normalized Average Score53.4
19
Offline Reinforcement LearningD4RL Adroit hammer-cloned v0--
12
Continuous ControlMuJoCo Hopper 4-p v4
Normalized Return99
6
Continuous ControlMuJoCo Hopper 2-p v4
Normalized Return106
6
Continuous ControlMuJoCo Hopper 10-p v4
Normalized Return94.5
6
Continuous ControlMuJoCo Walker2d 4-p v4
Normalized Return94.2
6
Continuous ControlMuJoCo Walker2d 10-p v4
Normalized Return102
6
Continuous ControlMuJoCo Ant 2-p v4
Normalized Return146.1
6
Showing 10 of 44 rows

Other info

Follow for update