Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
About
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score70.1 | 117 | |
| Offline Reinforcement Learning | D4RL hopper-medium-expert | Normalized Score82.3 | 115 | |
| Offline Reinforcement Learning | D4RL Medium-Replay Hopper | Normalized Score51.6 | 72 | |
| Offline Reinforcement Learning | D4RL Walker2d Medium v2 | Normalized Return70.8 | 67 | |
| Offline Reinforcement Learning | D4RL Medium HalfCheetah | Normalized Score45.8 | 59 | |
| Offline Reinforcement Learning | D4RL Medium-Replay HalfCheetah | Normalized Score27.9 | 59 | |
| Offline Reinforcement Learning | D4RL Medium Walker2d | Normalized Score60.1 | 58 | |
| Offline Reinforcement Learning | D4RL halfcheetah v2 (medium-replay) | Normalized Score26.9 | 58 | |
| Offline Reinforcement Learning | D4RL walker2d-expert v2 | Normalized Score106 | 56 | |
| Offline Reinforcement Learning | D4RL hopper-expert v2 | Normalized Score89.3 | 56 |