Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
About
We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Locomotion | Humanoid | Cumulative Reward1.46e+3 | 16 | |
| Multi-Agent Reinforcement Learning | SMAC 2m1z | State Entropy0.032 | 12 | |
| Strategy Discovery | GRF 3v1 | Distinct Strategies2.3 | 11 | |
| Multi-Agent Reinforcement Learning | GRF 3v1 hard | Win Rate94 | 7 | |
| State Entropy Estimation | GRF 3v1 | State Entropy0.011 | 7 | |
| Multi-Agent Reinforcement Learning | SMAC 2c64zg | Win Rate85 | 7 | |
| Multi-Agent Reinforcement Learning | GRF (CA) | Win Rate76 | 6 | |
| Multi-Agent Reinforcement Learning | SMAC 2c_vs_64zg | State Entropy0.07 | 6 | |
| Strategy Discovery | GRF (CA) | Distinct Strategies2 | 6 | |
| Strategy Discovery | GRF Corner | Distinct Strategies1.6 | 6 |