Transitive RL: Value Learning via Divide and Conquer
About
In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O(\log T)$ recursions (as opposed to $O(T)$ in TD learning) to handle a length-$T$ trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | puzzle-4x4-play OGBench 5 tasks v0 | Average Success Rate34 | 18 | |
| Overall | puzzle 4x5 | Success Rate9.70e+3 | 10 | |
| Overall | puzzle 4x6 | Success Rate5.10e+3 | 10 | |
| task1 | humanoidmaze giant | Success Rate71 | 10 | |
| task1 | puzzle 4x6 | Success Rate1.00e+4 | 10 | |
| task2 | puzzle 4x5 | Success Rate9.90e+3 | 10 | |
| task2 | puzzle 4x6 | Success Rate66 | 10 | |
| task3 | puzzle 4x5 | Success Rate100 | 10 | |
| task3 | puzzle 4x6 | Success Rate6.70e+3 | 10 | |
| task4 | humanoidmaze giant | Success Rate9.40e+3 | 10 |