Share your thoughts, 1 month free Claude Pro on usSee more

LLM-as-a-Judge on RewardBench

92.9Accuracy

Qwen3-Next-80B-A3B-Thinking

Updated 3mo ago

Evaluation Results

Method	Links
Qwen3-Next-80B-A3B-Thinking 2026.01		92.9
Qwen3-30B-A3B-Thinking-2507 2026.01		92.01
DeepSeek-R1 2026.01		91.18
QwQ-32B 2026.01		91.05
Qwen3-30B-A3B-Instruct-2507 2026.01		89.88
DeepSeek-V3 2026.01		89.74
Qwen2.5-32B-Instruct 2026.01		89.31
Qwen3-Next-80B-A3B-Instruct 2026.01		88.96
Genii 2025.10		82.48
Vanilla 2025.10		80.8
Judgment SFT 2025.10		73.77
Genii 2025.10		73.6
Genii 2025.10		72.63
Judgment SFT 2025.10		71.48
Judgment SFT 2025.10		71.06
Self-Consistency 2025.10		69.57
Vanilla 2025.10		69.41
Vanilla 2025.10		69.41
Genii 2025.10		67.17
Reprompt 2025.10		66.6
Self-Consistency 2025.10		66.36
Long Reasoning 2025.10		66.3
Vanilla 2025.10		64.66
Long Reasoning 2025.10		63.95
Reprompt 2025.10		60.4
Vanilla 2025.10		57.89
Genii 2025.10		56.65
Genii 2025.10		51.15
Self-Consistency 2025.10		50.5
Vanilla 2025.10		50.28
Reprompt 2025.10		40.2