Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning
About
Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce $Q$-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of $Q$-functions. By analyzing $Q$-function over-generalization, which impairs stable stitching, QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.
Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | hopper medium | Normalized Score96.4 | 68 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score94.1 | 61 | |
| Offline Reinforcement Learning | walker2d medium | Normalized Score88.2 | 61 | |
| Offline Reinforcement Learning | hopper medium-replay | Normalized Score100.4 | 55 | |
| Offline Reinforcement Learning | halfcheetah medium-replay | Normalized Score54.1 | 54 | |
| Offline Reinforcement Learning | halfcheetah medium | Normalized Score59 | 53 | |
| Offline Reinforcement Learning | D4RL Adroit pen (human) | Normalized Return83.9 | 53 | |
| Offline Reinforcement Learning | D4RL Adroit pen (cloned) | Normalized Return66.5 | 53 | |
| Offline Reinforcement Learning | antmaze medium-play | Score84.8 | 44 | |
| Offline Reinforcement Learning | Walker2d medium-expert | Normalized Score116.6 | 42 |
Showing 10 of 39 rows