Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPAR: Support-Preserving Action Rectification

About

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score97
169
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score101.9
109
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score76.7
74
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return76.2
53
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return62.7
53
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-expert
Normalized Score97
43
Offline Reinforcement LearningD4RL MuJoCo walker2d-medium-expert
Normalized Score113.4
36
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-replay
Normalized Score0.509
36
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-expert
Normalized Score108.7
36
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-replay
Normalized Score101.9
23
Showing 10 of 15 rows

Other info

Follow for update