Frictional Q-Learning
About
Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. In this study, we address this issue by drawing an analogy to static friction. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. To mitigate deviations toward unsupported actions, we propose Frictional Q-Learning, an off-policy algorithm that encodes supported actions as tangent directions using a contrastive variational autoencoder. We further show that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. Extensive empirical results on standard continuous-control benchmarks consistently demonstrate robust and stable performance compared with competitive baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score85.25 | 169 | |
| Offline Reinforcement Learning | D4RL Hopper-medium-expert v2 | Normalized Return90.29 | 61 | |
| Offline Reinforcement Learning | D4RL Medium-Replay Walker2d | Normalized Score54.87 | 52 | |
| Continuous Control | MuJoCo Ant v4 | Average Return6.18e+3 | 46 | |
| Continuous Control | MuJoCo Walker2d v4 | Normalized Performance56.5986 | 39 | |
| Continuous Control | MuJoCo HalfCheetah v4 | Average Return1.60e+4 | 36 | |
| Offline Reinforcement Learning | D4RL hopper medium-replay | Reward73.33 | 32 | |
| Offline Reinforcement Learning | D4RL Halfcheetah medium | Reward50.66 | 30 | |
| Offline Reinforcement Learning | D4RL HalfCheetah Med-Replay | Normalized Avg Return46.23 | 22 | |
| Offline Reinforcement Learning | D4RL Walker2d medium | Normalized Avg Return44.6 | 20 |