Distance Weighted Supervised Learning for Offline Interaction Data
About
Sequential decision making algorithms often struggle to leverage different sources of unstructured offline interaction data. Imitation learning (IL) methods based on supervised learning are robust, but require optimal demonstrations, which are hard to collect. Offline goal-conditioned reinforcement learning (RL) algorithms promise to learn from sub-optimal data, but face optimization challenges especially with high-dimensional data. To bridge the gap between IL and RL, we introduce Distance Weighted Supervised Learning or DWSL, a supervised method for learning goal-conditioned policies from offline data. DWSL models the entire distribution of time-steps between states in offline data with only supervised learning, and uses this distribution to approximate shortest path distances. To extract a policy, we weight actions by their reduction in distance estimates. Theoretically, DWSL converges to an optimal policy constrained to the data distribution, an attractive property for offline learning, without any bootstrapping. Across all datasets we test, DWSL empirically maintains behavior cloning as a lower bound while still exhibiting policy improvement. In high-dimensional image domains, DWSL surpasses the performance of both prior goal-conditioned IL and RL algorithms. Visualizations and code can be found at https://sites.google.com/view/dwsl/home .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | Kitchen Partial | Normalized Score74 | 62 | |
| Offline Reinforcement Learning | D4RL antmaze-umaze (diverse) | Normalized Score74.6 | 40 | |
| Offline Reinforcement Learning | antmaze medium-play | Score77.6 | 35 | |
| Offline Reinforcement Learning | kitchen mixed | Normalized Score74.6 | 29 | |
| Offline Reinforcement Learning | Antmaze umaze | Average Return71.2 | 24 | |
| Offline Reinforcement Learning | antmaze medium-diverse | Score74.8 | 18 | |
| Offline Reinforcement Learning | antmaze large-play | Score15.2 | 18 | |
| Offline Reinforcement Learning | AntMaze-Ultra-Play | Avg Normalized Score25.2 | 10 | |
| Offline Reinforcement Learning | AntMaze Ultra-Diverse | Avg Normalized Score2.50e+3 | 10 |