Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Instruction following on AlpacaEval High-Variance (Top 20%) 2.0

11.6Reward Score

rDPO (Best-of-K)

1.8764.40056.9259.4495Mar 9, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
11.69.05-6.5347
2026.03
11.568.65-5.74329
2026.03
8.889.4-9.92328
2026.03
8.819.21-9.61328
2026.03
7.8910.15-12.41371
2026.03
7.87.72-7.64172
2026.03
7.77.28-6.86164
2026.03
7.4810.22-12.96381
2026.03
7.288.12-8.96381
2026.03
7.268.1-8.94381
2026.03
7.248.1-8.96381
2026.03
7.228.4-9.58376
2026.03
7.218.55-9.89381
2026.03
7.198.15-9.11380
2026.03
7.178.34-9.51381
2026.03
7.177.3-7.43196
2026.03
7.127.29-7.46196
2026.03
3.456.89-10.33256
2026.03
2.795.46-8.13263
2026.03
2.593.76-4.93265
2026.03
2.534.09-5.65265
2026.03
2.354-5.65262
2026.03
2.344.42-6.5263
2026.03
2.313.84-5.37249
2026.03
2.34.07-5.84264
2026.03
2.254.15-6.05258