Beyond Reward: Offline Preference-guided Policy Optimization

About

This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, Donglin Wang• 2023

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score89.6	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score108	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score105	132
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score88.9	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score43.4	105
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score39.8	97
Offline Reinforcement Learning	D4RL walker2d medium-replay	Normalized Score71.7	62
Reinforcement Learning	DMC Walker	Walk Score219.6	13
Reinforcement Learning	DMC Cheetah	Run Score97.9	13
Reinforcement Learning	DMC Quadruped	Run Score395.5	13

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord