Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-judge evaluation on Shared 500-prompt sample
Loading...
0.87
Global Correlation (r)
GPT-5.2
0.2876
0.4388
0.59
0.7412
Mar 12, 2026
Global Correlation (r)
Within-Judge Correlation (r)
Agreement Gap (%)
Recovery Rate (%)
Updated 2mo ago
Evaluation Results
Method
Method
Links
Global Correlation (r)
Within-Judge Correlation (r)
Agreement Gap (%)
Recovery Rate (%)
GPT-5.2
Family=OpenAI
2026.03
0.87
0.7
20
69.4
Claude Sonnet 4
Family=Anthropic
2026.03
0.59
0.42
29
47.7
GPT-4.1-mini
Family=OpenAI
2026.03
0.56
0.47
17
43.6
Gemini-2.5-flash
Family=Google
2026.03
0.47
0.27
42
23.8
Llama-3.3-70b
Family=Meta
2026.03
0.31
0.23
25
18.6
Feedback
Search any
task
Search any
task