Combinatorial Optimization with Policy Adaptation using Latent Space Search
About
Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math Reasoning | AMC | Accuracy6.8 | 70 | |
| Capacitated Vehicle Routing Problem | CVRP procedurally transformed instances Mu-transformed (test) | Objective Value13.534 | 56 | |
| Traveling Salesman Problem | TSP procedurally transformed instances Mu 0-9 (test) | Objective Value6.738 | 50 | |
| Math Reasoning | MATH 500 | Accuracy17 | 38 | |
| Traveling Salesman Problem | TSP-500 | Solution Length16.81 | 32 | |
| Traveling Salesman Problem | TSP procedurally transformed instances Mu-transformed (test) | Objective Value6.738 | 26 | |
| Traveling Salesman Problem (TSP) | TSP n=150 Generalization 1K instances | Objective Value9.354 | 25 | |
| Traveling Salesperson Problem | TSP N=200 (Generalization (128 instances)) | Optimality Gap0.348 | 19 | |
| Capacitated Vehicle Routing Problem (CVRP) | CVRP n=150 1K instances (Generalization) | Objective Value19.313 | 18 | |
| Traveling Salesman Problem | TSP n=125 (1k instances) | Objective Value8.586 | 15 |