Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

About

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE

Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon• 2024

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningDMC Cheetah
Run Score109.1
13
Reinforcement LearningDMC PointMass
Top Left Score160.7
13
Reinforcement LearningDMC Quadruped
Run Score246.8
13
Reinforcement LearningDMC Walker
Walk Score199.1
13
ManipulationD4RL Adroit pen (human)
Normalized Score4.5
12
Robot ManipulationMeta-World
Success Rate (lever-pull)51.2
5
ManipulationMetaWorld Button-Press-Topdown (human-labeled)
Performance70
3
ManipulationAdroit Pen-cloned (human-labeled)
Performance14.4
3
Showing 8 of 8 rows

Other info

Follow for update