Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

About

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE

Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon• 2024

Related benchmarks

Task	Dataset	Result
Locomotion	D4RL walker2d-medium-expert v2	Average Online Return109.7	17
Locomotion	D4RL walker2d medium-replay v2	Offline Normalized Return71.3	16
Reinforcement Learning	DMC Cheetah	Run Score109.1	13
Reinforcement Learning	DMC PointMass	Top Left Score160.7	13
Reinforcement Learning	DMC Quadruped	Run Score246.8	13
Reinforcement Learning	DMC Walker	Walk Score199.1	13
Manipulation	D4RL Adroit pen (human)	Normalized Score4.5	12
Locomotion	D4RL Hopper-medium-expert v2	Return106.3	11
Robotic Manipulation	MetaWorld door-open v2	Success Rate84	11
Robotic Manipulation	MetaWorld sweep-into v2	Success Rate64	11

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord