Context Shift Reduction for Offline Meta-Reinforcement Learning

About

Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.

Yunkai Gao, Rui Zhang, Jiaming Guo, Fan Wu, Qi Yi, Shaohui Peng, Siming Lan, Ruizhi Chen, Zidong Du, Xing Hu, Qi Guo, Ling Li, Yunji Chen• 2023

Related benchmarks

Task	Dataset	Result
Meta-Reinforcement Learning	Hopper-Param (ID)	Average Return257	30
Meta-Reinforcement Learning	Cheetah-Vel-Sparse (OOD)	Average Return169	15
Offline Meta-Reinforcement Learning	Point-Robot sampled 10 unseen (test)	Average Return-6.4	10
Offline Meta-Reinforcement Learning	Half-Cheetah-Vel sampled 10 unseen (test)	Average Return-48.4	10
Offline Meta-Reinforcement Learning	Walker-Rand-Params sampled 10 unseen (test)	Average Return344.2	10
Meta-Reinforcement Learning	Cheetah-Vel ID	Average Return223	10
Meta-Reinforcement Learning	Cheetah-Vel	Average Return57	10
Meta-Reinforcement Learning	Point-Robot Sparse	Average Return13	10
Meta-Reinforcement Learning	Walker-Param (ID)	Average Return70	10
Meta-Reinforcement Learning	Ant Dir	Average Return141	10

Showing 10 of 53 rows

Other info

Code

Follow for update

@wizwand_team Discord