Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

About

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen• 2026

Related benchmarks

TaskDatasetResultRank
MathMATH 500
Accuracy77.4
120
Graph OptimizationFORGE-BENCH
Success Rate (SR)100
63
SchedulingFORGE-BENCH
SR96.3
42
KnowledgeGPQA Diamond
Accuracy (GPQA Knowledge)38.9
37
PlanningFORGE-BENCH
Success Rate (SR)99
21
InstructionIFEval
Score80.5
17
LogicKorBench
Accuracy46.6
11
MathOlpBench
Accuracy32.1
11
Showing 8 of 8 rows

Other info

Follow for update