Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

About

Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn• 2026

Related benchmarks

TaskDatasetResultRank
Success Rate EvaluationOGBench cube-double online (train)
Success Rate95.1
13
Success Rate EvaluationOGBench cube-triple (online train)
Success Rate52.1
13
Success Rate EvaluationOGBench humanoid-md (online train)
Success Rate96.2
13
Success Rate EvaluationOGBench antmaze-giant online (train)
Success Rate65.4
13
Success Rate EvaluationOGBench online scene (train)
Success Rate97.3
13
Robot ManipulationRoboMimic Can online (train)
Success Rate90
11
Robot ManipulationRoboMimic Square online (train)
Success Rate92.5
11
Overall Success Ratehumanoidmaze-giant online (train)
Success Rate75.7
6
task1humanoidmaze-giant online (train)
Success Rate31.3
6
task2humanoidmaze-giant online (train)
Success Rate97.3
6
Showing 10 of 13 rows

Other info

Follow for update