Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Preference Transformer: Modeling Human Preferences using Transformers for RL

About

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. Code is available on the project website: https://sites.google.com/view/preference-transformer.

Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score86.8
117
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score103
115
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score110.4
86
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score84.54
72
Offline Reinforcement LearningKitchen Partial
Normalized Score53.4
62
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score47.6
59
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score42.3
59
Offline Reinforcement LearningD4RL walker2d medium-replay
Normalized Score75.7
45
Offline Reinforcement LearningD4RL Medium-Replay Walker2d
Normalized Score77
34
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return53
32
Showing 10 of 19 rows

Other info

Follow for update