Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary Dynamics

About

Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.

Xinyu Zhang, Wenjie Qiu, Yi-Chen Li, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu• 2024

Related benchmarks

TaskDatasetResultRank
Cheetah-LSContextual-DMC (in-distribution)
Average Return901.7
6
Hopper-massMuJoCo in-distribution
Average Return555
6
Offline Meta-Reinforcement LearningMuJoCo Cheetah-LS In-distribution
Average Return895.3
6
Cheetah-speedContextual-DMC (in-distribution)
Average Return497.1
6
Meta-Reinforcement LearningMeta-World in-distribution v2 (test)
Assembly Success Rate0.00e+0
6
Offline Meta-Reinforcement LearningMuJoCo Hopper-mass In-distribution
Average Return563.3
6
Finger-LSContextual-DMC (in-distribution)
Average Return869.2
6
Offline Meta-Reinforcement LearningWalker-LS (out-of-distribution)
Average Return650.9
6
Walker-frictionMuJoCo in-distribution
Average Return513.9
6
Ant-dirMuJoCo in-distribution
Average Return526.9
6
Showing 10 of 28 rows

Other info

Follow for update