Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Pairwise Comparison on AUTO-J Eval-P
Loading...
62.28
Agreement
GPT-4
12.2248
25.2199
38.215
51.2101
Nov 30, 2023
Agreement
Consistency
Updated 4d ago
Evaluation Results
Method
Method
Links
Agreement
Consistency
GPT-4
Setting=Reference-Free
2023.11
62.28
86.28
CRITIQUELLM
Setting=Reference-Free
2023.11
50.93
82.76
AUTO-J-Bilingual-6B
Setting=Reference-Free
2023.11
49.43
77.23
ChatGPT
Setting=Reference-Free
2023.11
42.74
62.43
Mixtral-8x7B
Setting=Reference-Free
2023.11
35.2
52.66
JudgeLM-13B
Setting=Reference-Free
2023.11
35.13
58.19
Llama-2-70B-Chat
Setting=Reference-Free
2023.11
33.62
56.9
Qwen-14B-Chat
Setting=Reference-Free
2023.11
31.68
52.08
Baichuan2-13B-Chat
Setting=Reference-Free
2023.11
19.4
32.33
ChatGLM3-6B
Setting=Reference-Free
2023.11
14.15
26.22
Feedback
Search any
task
Search any
task