Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Iteratively Learn Diverse Strategies with State Distance Information

About

In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. It remains a fundamental challenge to optimize rewards while also discovering as many diverse strategies as possible, which can be crucial in many practical applications. Our study examines two design choices for tackling this challenge, i.e., diversity measure and computation framework. First, we find that with existing diversity measures, visually indistinguishable policies can still yield high diversity scores. To accurately capture the behavioral difference, we propose to incorporate the state-space distance information into the diversity measure. In addition, we examine two common computation frameworks for this problem, i.e., population-based training (PBT) and iterative learning (ITR). We show that although PBT is the precise problem formulation, ITR can achieve comparable diversity scores with higher computation efficiency, leading to improved solution quality in practice. Based on our analysis, we further combine ITR with two tractable realizations of the state-distance-based diversity measures and develop a novel diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. We empirically examine SIPO across three domains from robot locomotion to multi-agent games. In all of our testing environments, SIPO consistently produces strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.

Wei Fu, Weihua Du, Jingwei Li, Sunli Chen, Jingzhao Zhang, Yi Wu• 2023

Related benchmarks

TaskDatasetResultRank
Robot LocomotionHumanoid
Cumulative Reward3.76e+3
16
Multi-Agent Reinforcement LearningSMAC 2m1z
State Entropy0.038
12
Strategy DiscoveryGRF 3v1
Distinct Strategies5.7
11
State Entropy EstimationGRF 3v1
State Entropy0.012
7
Multi-Agent Reinforcement LearningGRF 3v1 hard
Win Rate93
7
Multi-Agent Reinforcement LearningSMAC 2c64zg
Win Rate99
7
Multi-Agent Reinforcement LearningSMAC 2c_vs_64zg
State Entropy0.072
6
Strategy DiscoveryGRF (CA)
Distinct Strategies3.3
6
Strategy DiscoveryGRF Corner
Distinct Strategies3
6
Multi-Agent Reinforcement LearningGRF (CA)
Win Rate70
6
Showing 10 of 14 rows

Other info

Follow for update