Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL

About

Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL

Peng Cheng, Xianyuan Zhan, Zhihao Wu, Wenjia Zhang, Shoucheng Song, Han Wang, Youfang Lin, Li Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL MuJoCo Hopper-mr v2 (medium-replay)
Avg Normalized Score78.7
36
Hand ManipulationAdroit door-human
Normalized Avg Score0.6
33
Offline Reinforcement LearningD4RL MuJoCo Hopper-m v2 (medium)
Avg Normalized Score86.7
31
Offline Reinforcement LearningD4RL Maze2D
Return (UMaze)76.9
31
Offline Reinforcement LearningD4RL MuJoCo Walker2d medium-expert v2
Average Normalized Score109.8
31
Offline Reinforcement LearningD4RL MuJoCo Walker2d-mr v2 (medium-replay)
Average Normalized Score66.1
29
Offline Reinforcement LearningD4RL MuJoCo Halfcheetah-mr v2 (medium-replay)
Avg Normalized Score42.2
24
Hand ManipulationAdroit door-cloned
Normalized Score0.1
23
Offline Reinforcement LearningD4RL Mujoco Hopper-Medium-Expert v2
Normalized Score95.9
22
Offline Reinforcement LearningD4RL Locomotion Full datasets
Hopper Score (m)86.7
21
Showing 10 of 28 rows

Other info

Code

Follow for update