Share your thoughts, 1 month free Claude Pro on usSee more

Pairwise Preference Ranking on UltraFeedback 10% holdout (test)

86.3Pairwise Accuracy (RM1-Honest)

BENCHALIGN

Updated 4mo ago

Evaluation Results

Method	Links
BENCHALIGN 2026.02		86.3	86.2	0.894	0.891
RANDOM 2026.02		70.4	69.6	0.588	0.571
METABENCH 2026.02		70.2	69.8	0.58	0.566
TINYBENCHMARKS 2026.02		70.1	69.8	0.581	0.567