Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization
About
Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack, and job-shop scheduling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Capacitated Vehicle Routing Problem | CVRP procedurally transformed instances Mu-transformed (test) | Objective Value13.566 | 56 | |
| Traveling Salesman Problem (TSP) | TSP n=100 10K instances (test) | Objective Value7.77 | 52 | |
| Traveling Salesman Problem | TSP procedurally transformed instances Mu 0-9 (test) | Objective Value6.744 | 50 | |
| Traveling Salesman Problem | TSP procedurally transformed instances Mu-transformed (test) | Objective Value6.744 | 26 | |
| Traveling Salesman Problem (TSP) | TSP n=150 Generalization 1K instances | Objective Value9.359 | 25 | |
| Capacitated Vehicle Routing Problem | CVRP n=100 (10k instances) | Optimality Gap0.39 | 21 | |
| Traveling Salesman Problem | TSP 1000 | Objective Value39.7 | 20 | |
| Job-Shop Scheduling Problem | JSSP 100 instances 10x10 (test) | Objective Value849.7 | 19 | |
| Traveling Salesperson Problem | TSP N=200 (Generalization (128 instances)) | Optimality Gap1.007 | 19 | |
| Capacitated Vehicle Routing Problem (CVRP) | CVRP n=150 1K instances (Generalization) | Objective Value19.421 | 18 |