Share your thoughts, 1 month free Claude Pro on usSee more

LLM-as-a-Judge on JudgeBench

84.19Accuracy

DeepSeek-V3

Updated 4mo ago

Evaluation Results

Method	Links
DeepSeek-V3 2026.01		84.19	-	-
Qwen3-30B-A3B-Thinking-2507 2026.01		83.87	-	-
Qwen3-Next-80B-A3B-Thinking 2026.01		82.42	-	-
DeepSeek-R1 2026.01		80.48	-	-
QwQ-32B 2026.01		79.75	-	-
Qwen3-Next-80B-A3B-Instruct 2026.01		79.45	-	-
Qwen3-30B-A3B-Instruct-2507 2026.01		74	-	-
PIF 2026.03		62.2	68.4	36.9
GRPO 2026.03		61.4	74.2	45.1
Qwen2.5-32B-Instruct 2026.01		60.4	-	-
PA-GRPO 2026.03		60.1	70	45.3
PA-GRPO 2026.03		59.4	75.2	43.4
PA-GRPO 2026.03		57.1	58.3	32.4
PriDe 2026.03		56.8	63.5	33.1
UniBias 2026.03		56.2	64	32.5
Base 2026.03		55.4	62.1	29.7
PIF 2026.03		54.3	59.6	37.4
PIF 2026.03		53.3	59.2	30.4
CalibraEval 2026.03		52.9	61	28.9
UniBias 2026.03		52.2	26.1	14.9
PriDe 2026.03		51.2	48.8	29.8
GRPO 2026.03		50.4	62.6	34.8
UniBias 2026.03		50.2	23	10.9
CalibraEval 2026.03		49.7	56.4	31.3
CalibraEval 2026.03		49.3	15.7	7.1
PriDe 2026.03		49.1	16.2	7.2
GRPO 2026.03		48.2	56.1	28.2
Base 2026.03		43.9	45.5	16.5
Base 2026.03		35	34.8	6.1