Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba
About
We study efficiency as a first-class objective in Neural Combinatorial Optimization (NCO) and present ECO, an efficient learning framework that combines batched preference optimization with a Mamba backbone. Instead of tightly interleaving every policy update with on-policy rollouts, ECO decouples trajectory generation from gradient updates through two stages: supervised warm-up on pre-computed solutions and iterative Direct Preference Optimization (DPO) on batched candidate sets generated by the current policy. We pair this learning pipeline with a mixed Mamba encoder-decoder that reduces memory growth on long sequences and improves hardware utilization. A local-search-guided bootstrapping strategy is further used during training to widen preference margins and stabilize iterative improvement. Importantly, local search is only used to construct stronger preference pairs during training and is never invoked at inference time. On TSP and CVRP, ECO achieves the strongest overall performance among the compared neural baselines while also delivering clear advantages in memory usage and throughput. We provide additional analysis on memory scaling, throughput, and the contribution of each design component.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Capacitated Vehicle Routing Problem | CVRP N=100 | Objective Value15.69 | 87 | |
| Traveling Salesman Problem | TSP-200 | Optimality Gap0.37 | 41 | |
| Traveling Salesman Problem | TSP-500 | Solution Length16.98 | 38 | |
| Capacitated Vehicle Routing Problem | CVRP-200 | Objective Value20.13 | 35 | |
| Capacitated Vehicle Routing Problem | CVRP 1000 | Objective Value65.08 | 29 | |
| Capacitated Vehicle Routing Problem | CVRP500 | Objective Value38.21 | 25 | |
| Traveling Salesperson Problem | TSP1000 | Objective Value24.24 | 8 | |
| Traveling Salesperson Problem | TSP5000 | Objective Value53.76 | 8 |