MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

About

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti• 2025

Related benchmarks

Task	Dataset	Result
Code Correctness Evaluation	APPS	Accuracy80	53
Code Correctness Evaluation	BCB	F1 Score42.5	25
Code Correctness Evaluation	HE-PY	F1 Score46.4	25
Code Correctness Evaluation	HE-CPP	F1 Score30	25
Code Correctness Evaluation	HE-Go	F1 Score38.4	25
Code Correctness Evaluation	HE-JS	F1 Score35.2	25
Code Correctness Evaluation	HE-JA	F1 Score34.5	25
Multilingual Code Evaluation	HumanEval-X	Python Score90.15	24
Code evaluation	BigCodeBench	Accuracy79.12	23
Meta-reasoning quality assessment	APPS	Thoroughness85.6	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord