Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
About
Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl-bandits-lab.github.io/Cross-Domain-RL/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Ant | Average Reward2.86e+3 | 26 | |
| Continuous Control | MuJoCo HalfCheetah | Average Reward1.16e+4 | 25 | |
| Continuous Control | Robosuite Door Opening | Final Reward216.6 | 7 | |
| Continuous Control | Robosuite Table Wiping | Final Reward76.6 | 7 | |
| Continuous Control | Safety-Gym Navigation | Final Reward38.5 | 7 | |
| Continuous Control | Halfcheetah | Steps to Threshold1.26e+5 | 2 | |
| Navigation | Navigation | Steps to Threshold2.18e+5 | 2 | |
| Robotic Manipulation | Door Opening | Environment Steps48 | 2 | |
| Robotic Manipulation | Table Wiping | Environment Steps to Threshold7.20e+4 | 2 |