CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

About

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	CodeContests	Avg@839.33	26
Code Generation	CodeForces	Avg@824.89	22
Code Generation	LiveBench	Avg@841.69	22
Code Generation	LiveCodeBench	Avg@827.78	12
Code Generation	MBPP	Avg@882.29	12
Reward Modeling	RM-Bench (full)	Chat Score83	11

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord