Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning
About
Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | hopper medium | Normalized Score102.1 | 68 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score101.3 | 61 | |
| Offline Reinforcement Learning | walker2d medium | Normalized Score94.7 | 61 | |
| Offline Reinforcement Learning | hopper medium-replay | Normalized Score102.2 | 55 | |
| Offline Reinforcement Learning | halfcheetah medium-replay | Normalized Score57.1 | 54 | |
| Offline Reinforcement Learning | halfcheetah medium | Normalized Score65.3 | 53 | |
| Offline Reinforcement Learning | antmaze medium-play | Score85.6 | 44 | |
| Offline Reinforcement Learning | Walker2d medium-expert | Normalized Score121.4 | 42 | |
| Offline Reinforcement Learning | HalfCheetah Vel | Maximum episode return-1.20e+3 | 40 | |
| Offline Reinforcement Learning | Hopper medium-expert | Normalized Score114 | 35 |