Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning
About
Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce $Q$-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of $Q$-functions. By analyzing $Q$-function over-generalization, which impairs stable stitching, QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.
Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | antmaze medium-play | Score84.8 | 35 | |
| Offline Reinforcement Learning | D4RL Adroit pen (human) | Normalized Return83.9 | 32 | |
| Offline Reinforcement Learning | D4RL Adroit pen (cloned) | Normalized Return66.5 | 32 | |
| Offline Reinforcement Learning | MuJoCo hopper D4RL (medium-replay) | Normalized Return100.4 | 26 | |
| Offline Reinforcement Learning | MuJoCo walker2d-medium D4RL | Normalized Return88.2 | 20 | |
| Offline Reinforcement Learning | MuJoCo halfcheetah-medium-replay D4RL | Normalized Return54.1 | 20 | |
| Offline Reinforcement Learning | MuJoCo walker2d medium-replay D4RL | Normalized Return94.1 | 20 | |
| Offline Reinforcement Learning | MuJoCo halfcheetah-medium D4RL | Normalized Return59 | 20 | |
| Offline Reinforcement Learning | MuJoCo walker2d medium-expert D4RL | Normalized Return116.6 | 18 | |
| Offline Reinforcement Learning | MuJoCo halfcheetah-medium-expert D4RL | Normalized Return93.3 | 18 |
Showing 10 of 17 rows