Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

About

We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.

Zihan Zhou, Wei Fu, Bingliang Zhang, Yi Wu• 2022

Related benchmarks

TaskDatasetResultRank
Robot LocomotionHumanoid
Cumulative Reward1.46e+3
16
Multi-Agent Reinforcement LearningSMAC 2m1z
State Entropy0.032
12
Strategy DiscoveryGRF 3v1
Distinct Strategies2.3
11
Multi-Agent Reinforcement LearningGRF 3v1 hard
Win Rate94
7
State Entropy EstimationGRF 3v1
State Entropy0.011
7
Multi-Agent Reinforcement LearningSMAC 2c64zg
Win Rate85
7
Multi-Agent Reinforcement LearningGRF (CA)
Win Rate76
6
Multi-Agent Reinforcement LearningSMAC 2c_vs_64zg
State Entropy0.07
6
Strategy DiscoveryGRF (CA)
Distinct Strategies2
6
Strategy DiscoveryGRF Corner
Distinct Strategies1.6
6
Showing 10 of 13 rows

Other info

Follow for update