Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

About

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

Yuxiao Yang, Weitong Zhang• 2026

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	hopper medium	Normalized Score102.1	68
Offline Reinforcement Learning	walker2d medium-replay	Normalized Score101.3	61
Offline Reinforcement Learning	walker2d medium	Normalized Score94.7	61
Offline Reinforcement Learning	hopper medium-replay	Normalized Score102.2	55
Offline Reinforcement Learning	halfcheetah medium-replay	Normalized Score57.1	54
Offline Reinforcement Learning	halfcheetah medium	Normalized Score65.3	53
Offline Reinforcement Learning	antmaze medium-play	Score85.6	44
Offline Reinforcement Learning	Walker2d medium-expert	Normalized Score121.4	42
Offline Reinforcement Learning	HalfCheetah Vel	Maximum episode return-1.20e+3	40
Offline Reinforcement Learning	Hopper medium-expert	Normalized Score114	35

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord