Efficient Cross-Domain Offline Reinforcement Learning with Dynamics- and Value-Aligned Data Filtering
About
Cross-domain offline reinforcement learning (RL) aims to train a well-performing agent in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between source and target domains, naively merging the two datasets may incur inferior performance. Recent advances address this issue by selectively leveraging source domain samples whose dynamics align well with the target domain. However, our work demonstrates that dynamics alignment alone is insufficient, by examining the limitations of prior frameworks and deriving a new target domain sub-optimality bound for the policy learned on the source domain. More importantly, our theory underscores an additional need for \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain, a critical dimension overlooked by existing works. Motivated by such theoretical insight, we propose \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method, a novel unified cross-domain RL framework that selectively incorporates source domain samples exhibiting strong alignment in \textit{both dynamics and values}. We empirically study a range of dynamics shift scenarios, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, even in the challenging setting where the target domain dataset contains an extremely limited amount of data. Extensive experiments demonstrate that DVDF consistently outperforms strong baselines with significant improvements.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | hopper medium | Normalized Score20.3 | 52 | |
| Offline Reinforcement Learning | walker2d medium | Normalized Score24.3 | 51 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score4.8 | 50 | |
| Offline Reinforcement Learning | hopper medium-replay | Normalized Score7.4 | 44 | |
| Offline Reinforcement Learning | halfcheetah medium-replay | Normalized Score25.1 | 43 | |
| Offline Reinforcement Learning | halfcheetah medium | Normalized Score26.7 | 43 | |
| Offline Reinforcement Learning | Walker2d medium-expert | Normalized Score23 | 31 | |
| Offline Reinforcement Learning | Hopper medium-expert | Normalized Score43.2 | 24 | |
| Offline Reinforcement Learning | Hopper expert | Normalized Score48.9 | 19 | |
| Offline Reinforcement Learning | Halfcheetah medium-expert | Normalized Score21.9 | 15 |