Query-Policy Misalignment in Preference-Based Reinforcement Learning

About

Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang• 2023

Related benchmarks

Task	Dataset	Result
door-open	Meta-World	Door Open Success Rate100	28
window-open	Meta-World window-open	ASR40	20
window-close	Meta-World window-close	ASR26	20
door-lock	Meta-World	Success Rate90	14
Handle Press	Meta-World	Success Rate80	14
door-unlock	Meta-World	Success Rate30	14
Box Open	Meta-World sim	Box Open Success Rate90	6
box-close	Meta-World sim	Box Close Success Rate80	6
box-close	UR5 real	Success Rate35	6
Door Close	Meta-World sim	Success Rate100	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord