Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

About

Offline reinforcement learning endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the out-of-distribution problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent advantage-weighted methods prioritize samples with high advantage values for agent training while inevitably ignoring the diversity of behavior policy. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts. Our code is available at https://github.com/Plankson/A2PO

Yunpeng Qing, Shunyu liu, Jingyuan Cong, Kaixuan Chen, Yihe Zhou, Mingli Song• 2024

Related benchmarks

TaskDatasetResultRank
hopper locomotionD4RL hopper medium-replay
Normalized Score101.6
56
walker2d locomotionD4RL walker2d medium-replay
Normalized Score82.8
53
LocomotionD4RL walker2d-medium-expert
Normalized Score112.1
47
LocomotionD4RL Walker2d medium
Normalized Score84.9
44
LocomotionD4RL Halfcheetah medium
Normalized Score47.1
44
hopper locomotionD4RL hopper-medium-expert
Normalized Score113.4
38
hopper locomotionD4RL Hopper medium
Normalized Score80.3
38
LocomotionD4RL halfcheetah-medium-expert
Normalized Score95.6
37
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.448
33
Offline Reinforcement LearningD4RL Kitchen kitchen-partial v0 (test)
Normalized Score75.8
18
Showing 10 of 39 rows

Other info

Code

Follow for update