Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization

About

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack, and job-shop scheduling.

Nathan Grinsztajn, Daniel Furelos-Blanco, Shikha Surana, Cl\'ement Bonnet, Thomas D. Barrett• 2022

Related benchmarks

Task	Dataset	Result
Capacitated Vehicle Routing Problem	CVRP procedurally transformed instances Mu-transformed (test)	Objective Value13.566	56
Traveling Salesman Problem (TSP)	TSP n=100 10K instances (test)	Objective Value7.77	52
Traveling Salesman Problem	TSP procedurally transformed instances Mu 0-9 (test)	Objective Value6.744	50
Traveling Salesperson Problem	TSP N=200 (Generalization (128 instances))	Optimality Gap1.007	35
Traveling Salesman Problem	TSP procedurally transformed instances Mu-transformed (test)	Objective Value6.744	26
Traveling Salesperson Problem	TSP n=100 (train)	Objective Value7.766	26
Traveling Salesman Problem (TSP)	TSP n=150 Generalization 1K instances	Objective Value9.359	25
Traveling Salesman Problem	TSP 1000	Objective Value39.7	23
Capacitated Vehicle Routing Problem	CVRP n=100 (10k instances)	Optimality Gap0.39	21
Job-Shop Scheduling Problem	JSSP 100 instances 10x10 (test)	Objective Value849.7	19

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord