POMO: Policy Optimization with Multiple Optima for Reinforcement Learning

About

In neural combinatorial optimization (CO), reinforcement learning (RL) can turn a deep neural net into a fast, powerful heuristic solver of NP-hard problems. This approach has a great potential in practical applications because it allows near-optimal solutions to be found without expert guides armed with substantial domain knowledge. We introduce Policy Optimization with Multiple Optima (POMO), an end-to-end approach for building such a heuristic solver. POMO is applicable to a wide range of CO problems. It is designed to exploit the symmetries in the representation of a CO solution. POMO uses a modified REINFORCE algorithm that forces diverse rollouts towards all optimal solutions. Empirically, the low-variance baseline of POMO makes RL training fast and stable, and it is more resistant to local minima compared to previous approaches. We also introduce a new augmentation-based inference method, which accompanies POMO nicely. We demonstrate the effectiveness of POMO by solving three popular NP-hard problems, namely, traveling salesman (TSP), capacitated vehicle routing (CVRP), and 0-1 knapsack (KP). For all three, our solver based on POMO shows a significant improvement in performance over all recent learned heuristics. In particular, we achieve the optimality gap of 0.14% with TSP100 while reducing inference time by more than an order of magnitude.

Yeong-Dae Kwon, Jinho Choo, Byoungjip Kim, Iljoo Yoon, Youngjune Gwon, Seungjai Min• 2020

Related benchmarks

Task	Dataset	Result
Traveling Salesman Problem	TSPLIB (test)	Tour Length2.71e+3	115
Capacitated Vehicle Routing Problem	CVRPLib Set X	Average Optimality Gap3.191	114
Traveling Salesman Problem	TSP-500 (test)	Gap16.25	95
Capacitated Vehicle Routing Problem	CVRP N=100	Objective Value15.741	87
Traveling Salesman Problem	TSP50	Optimality Gap0.03	77
Traveling Salesman Problem	TSP-100	Optimality Drop0.17	69
Traveling Salesman Problem	TSP-1000 (test)	Optimality Gap114.4	57
Capacitated Vehicle Routing Problem	CVRP procedurally transformed instances Mu-transformed (test)	Objective Value13.612	56
Traveling Salesman Problem (TSP)	TSP n=100 10K instances (test)	Objective Value7.771	52
Traveling Salesman Problem	TSP procedurally transformed instances Mu 0-9 (test)	Objective Value6.77	50

Showing 10 of 332 rows

...

Other info

Follow for update

@wizwand_team Discord