Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

About

We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($\lambda$) (CPQL). Our algorithm adapts the Peng's Q($\lambda$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at https://github.com/oh-lab/CPQL.

Byeongchan Kim, Min-hwan Oh• 2026

Related benchmarks

TaskDatasetResultRank
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.603
68
LocomotionD4RL MuJoCo Tasks
Average D4RL Locomotion Score (v2)1.25e+3
29
Locomotionwalker2d medium-replay v2
Average Normalized Score97.4
19
Locomotionhalfcheetah medium v2
Average Normalized Score66.6
19
Locomotionhalfcheetah medium-expert v2
Average Normalized Score95.3
19
Locomotionwalker2d medium v2
Average Normalized Score90
19
LocomotionWalker2d Medium-Expert v2
Average Normalized Score112.9
19
LocomotionMuJoCo walker2d medium-replay D4RL
Average Normalized Score128.6
16
LocomotionD4RL hopper v2 (medium)
Normalized Return103
16
LocomotionMuJoCo hopper-random
Normalized Score102
14
Showing 10 of 39 rows

Other info

Follow for update