Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning
About
Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient \textbf{D}iffusion \textbf{U}ncertainty-\textbf{A}ware framework for offline-to-online reinforcement \textbf{L}earning. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a proxy for a principled estimate of epistemic uncertainty. Empirically, DUAL improves the online expected return over O2O-RL baselines across multiple settings and environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Locomotion | D4RL Hopper-medium-replay v2 | Online Normalized Return109.2 | 12 | |
| Locomotion | D4RL walker2d medium-replay v2 | Online Normalized Return93.15 | 12 | |
| Locomotion | D4RL Hopper Medium v2 | Online Normalized Return88.23 | 12 | |
| Locomotion | D4RL HalfCheetah-medium-replay v2 | Online Normalized Return48.78 | 12 | |
| Locomotion | D4RL HalfCheetah Medium v2 | Online Return (Normalized)49.83 | 12 | |
| Offline-to-Online Reinforcement Learning | pen-cloned v1 | Avg Online Return124.4 | 8 | |
| Offline-to-Online Reinforcement Learning | door-cloned v1 | Average Online Return15.26 | 8 | |
| Offline-to-Online Reinforcement Learning | hammer-cloned v1 | Average Online Expected Return46.74 | 8 | |
| Offline-to-Online Reinforcement Learning | relocate cloned v1 | Average Online Expected Return0.44 | 8 | |
| Offline-to-Online Reinforcement Learning | Adroit Average | Average Online Return46.71 | 8 |