Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba

About

We study efficiency as a first-class objective in Neural Combinatorial Optimization (NCO) and present ECO, an efficient learning framework that combines batched preference optimization with a Mamba backbone. Instead of tightly interleaving every policy update with on-policy rollouts, ECO decouples trajectory generation from gradient updates through two stages: supervised warm-up on pre-computed solutions and iterative Direct Preference Optimization (DPO) on batched candidate sets generated by the current policy. We pair this learning pipeline with a mixed Mamba encoder-decoder that reduces memory growth on long sequences and improves hardware utilization. A local-search-guided bootstrapping strategy is further used during training to widen preference margins and stabilize iterative improvement. Importantly, local search is only used to construct stronger preference pairs during training and is never invoked at inference time. On TSP and CVRP, ECO achieves the strongest overall performance among the compared neural baselines while also delivering clear advantages in memory usage and throughput. We provide additional analysis on memory scaling, throughput, and the contribution of each design component.

Zhenxing Xu, Zeyuan Ma, Weidong Bao, Yan Zheng, Ji Wang, Zhiguang Cao• 2026

Related benchmarks

Task	Dataset	Result
Capacitated Vehicle Routing Problem	CVRP N=100	Objective Value15.69	87
Traveling Salesman Problem	TSP-200	Optimality Gap0.37	41
Traveling Salesman Problem	TSP-500	Solution Length16.98	38
Capacitated Vehicle Routing Problem	CVRP-200	Objective Value20.13	35
Capacitated Vehicle Routing Problem	CVRP 1000	Objective Value65.08	29
Capacitated Vehicle Routing Problem	CVRP500	Objective Value38.21	25
Traveling Salesperson Problem	TSP1000	Objective Value24.24	8
Traveling Salesperson Problem	TSP5000	Objective Value53.76	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord