Share your thoughts, 1 month free Claude Pro on usSee more

Pair-wise comparison on MTBench Human

88.9Accuracy

CCE@16

Updated 3mo ago

Evaluation Results

Method	Links
CCE@16 2025.02		88.9
CCE@16 2025.02		83.6
CCE@16 2025.02		83.5
CCE-random@16 2025.02		83.1
16-Criteria 2025.02		82.8
Agg@16 2025.02		82.6
Maj@16 2025.02		82.4
Vanilla 2025.02		82.1
CCE@16 2025.02		82.1
LongPrompt 2025.02		81.8
EvalPlan 2025.02		81.4
Vanilla 2025.02		81.1
Vanilla 2025.02		79.5
Vanilla 2025.02		79
CCE@16 2025.02		76.7
Vanilla 2025.02		76.1