Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization

About

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack, and job-shop scheduling.

Nathan Grinsztajn, Daniel Furelos-Blanco, Shikha Surana, Cl\'ement Bonnet, Thomas D. Barrett• 2022

Related benchmarks

TaskDatasetResultRank
Capacitated Vehicle Routing ProblemCVRP procedurally transformed instances Mu-transformed (test)
Objective Value13.566
56
Traveling Salesman Problem (TSP)TSP n=100 10K instances (test)
Objective Value7.77
52
Traveling Salesman ProblemTSP procedurally transformed instances Mu 0-9 (test)
Objective Value6.744
50
Traveling Salesman ProblemTSP procedurally transformed instances Mu-transformed (test)
Objective Value6.744
26
Traveling Salesman Problem (TSP)TSP n=150 Generalization 1K instances
Objective Value9.359
25
Capacitated Vehicle Routing ProblemCVRP n=100 (10k instances)
Optimality Gap0.39
21
Traveling Salesman ProblemTSP 1000
Objective Value39.7
20
Job-Shop Scheduling ProblemJSSP 100 instances 10x10 (test)
Objective Value849.7
19
Traveling Salesperson ProblemTSP N=200 (Generalization (128 instances))
Optimality Gap1.007
19
Capacitated Vehicle Routing Problem (CVRP)CVRP n=150 1K instances (Generalization)
Objective Value19.421
18
Showing 10 of 28 rows

Other info

Follow for update