Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Preference Transformer: Modeling Human Preferences using Transformers for RL

About

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. Code is available on the project website: https://sites.google.com/view/preference-transformer.

Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score86.8
169
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score103
161
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score110.4
132
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score84.54
109
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score47.6
105
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score42.3
97
Offline Reinforcement LearningKitchen Partial
Normalized Score53.4
69
Offline Reinforcement LearningD4RL walker2d medium-replay
Normalized Score75.7
62
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return53
53
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return42.9
53
Showing 10 of 36 rows

Other info

Follow for update