Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Doubly Mild Generalization for Offline Reinforcement Learning

About

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, Xiangyang Ji• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score91.1
155
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score110.4
153
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score114.4
124
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score101.9
97
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score54.9
97
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score92.4
96
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score4.8
93
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score28.8
86
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score51.4
84
Offline Reinforcement LearningD4RL hopper-random
Normalized Score20.4
78
Showing 10 of 41 rows

Other info

Code

Follow for update