Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zero-shot Language Evaluation on Gauntlet 20 benchmarks (test)

9.2Average Normalized Accuracy

Prior-based

4.79045.93527.088.2248Sep 23, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.09
9.29.5311.2710.3111.133.79
2025.09
8.229.9811.917.347.913.96
2025.09
7.567.036.847.3112.673.97
2025.09
7.096.716.116.8911.933.82
2025.09
6.655.039.134.2211.213.66
2025.09
5.785.520.446.1413.223.59
2025.09
5.65.684.931.9711.63.8
2025.09
5.395.124.291.7412.313.49
2025.09
5.265.476.532.97.843.58
2025.09
4.964.961.811.4712.833.7