Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

About

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

Yuxiao Yang, Weitong Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learninghopper medium
Normalized Score102.1
68
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score101.3
61
Offline Reinforcement Learningwalker2d medium
Normalized Score94.7
61
Offline Reinforcement Learninghopper medium-replay
Normalized Score102.2
55
Offline Reinforcement Learninghalfcheetah medium-replay
Normalized Score57.1
54
Offline Reinforcement Learninghalfcheetah medium
Normalized Score65.3
53
Offline Reinforcement Learningantmaze medium-play
Score85.6
44
Offline Reinforcement LearningWalker2d medium-expert
Normalized Score121.4
42
Offline Reinforcement LearningHalfCheetah Vel
Maximum episode return-1.20e+3
40
Offline Reinforcement LearningHopper medium-expert
Normalized Score114
35
Showing 10 of 24 rows

Other info

Follow for update