Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Pairwise Preference Ranking on UltraFeedback 10% holdout (test)
Loading...
86.3
Pairwise Accuracy (RM1-Honest)
BENCHALIGN
69.452
73.826
78.2
82.574
Feb 2, 2026
Pairwise Accuracy (RM1-Honest)
Pairwise Accuracy (RM2-Honest)
Spearman Rho (RM1-Honest)
Spearman Rho (RM2-Honest)
Updated 4d ago
Evaluation Results
Method
Method
Links
Pairwise Accuracy (RM1-Honest)
Pairwise Accuracy (RM2-Honest)
Spearman Rho (RM1-Honest)
Spearman Rho (RM2-Honest)
BENCHALIGN
Model set type=Arbitrary
2026.02
86.3
86.2
0.894
0.891
RANDOM
Model set type=Arbitrary
2026.02
70.4
69.6
0.588
0.571
METABENCH
Model set type=Arbitrary
2026.02
70.2
69.8
0.58
0.566
TINYBENCHMARKS
Model set type=Arbitrary
2026.02
70.1
69.8
0.581
0.567
Feedback
Search any
task
Search any
task