Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL

About

Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL

Peng Cheng, Xianyuan Zhan, Zhihao Wu, Wenjia Zhang, Shoucheng Song, Han Wang, Youfang Lin, Li Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Hand ManipulationAdroit door-human
Normalized Avg Score0.6
33
Offline Reinforcement LearningD4RL MuJoCo Hopper-mr v2 (medium-replay)
Avg Normalized Score78.7
29
Offline Reinforcement LearningD4RL MuJoCo Walker2d-mr v2 (medium-replay)
Average Normalized Score66.1
29
Offline Reinforcement LearningD4RL MuJoCo Hopper-m v2 (medium)
Avg Normalized Score86.7
24
Offline Reinforcement LearningD4RL MuJoCo Walker2d medium-expert v2
Average Normalized Score109.8
24
Offline Reinforcement LearningD4RL MuJoCo Halfcheetah-mr v2 (medium-replay)
Avg Normalized Score42.2
24
Hand ManipulationAdroit door-cloned
Normalized Score0.1
23
Offline Reinforcement LearningD4RL Mujoco Hopper-Medium-Expert v2
Normalized Score95.9
22
Offline Reinforcement LearningD4RL AntMaze v2 (various)
UMaze Success Rate74.3
20
PenAdroit Pen Human v0
Normalized Score85.7
19
Showing 10 of 28 rows

Other info

Code

Follow for update