Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

About

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen• 2026

Related benchmarks

Task	Dataset	Result
Math	MATH 500	Accuracy77.4	126
Graph Optimization	FORGE-BENCH	Success Rate (SR)100	63
Knowledge	GPQA Diamond	Accuracy (GPQA Knowledge)38.9	49
Scheduling	FORGE-BENCH	SR96.3	42
Planning	FORGE-BENCH	Success Rate (SR)99	21
Instruction	IFEval	Score80.5	17
Logic	KorBench	Accuracy46.6	11
Math	OlpBench	Accuracy32.1	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord