Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Language Model Evaluation on Pooled tasks Table 5 Llama-3.1 3.3 (various)

57.15Pooled Accuracy Estimate (γ̂)

Llama-3.3 70B Instruct

16.12226.773537.42548.0765Feb 9, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
57.15------
2026.02
56.960.20.130.06330.1130.06063.21
2026.02
56.870.290.140.02510.04620.07574.18
2026.02
56.74------
2026.02
56.650.090.160.3070.5690.6246.83
2026.02
56.550.610.14000.00024.18
2026.02
55.870.870.2100.008010.64
2026.02
42.58-0.140.10.9210.9930.9812.73
2026.02
42.510.0790.10.7870.71440.8662.76
2026.02
42.48-0.040.190.6010.2820.5498.71
2026.02
42.45-0.020.070.6290.8040.8681.31
2026.02
42.45-0.010.10.5640.7580.8882.4
2026.02
42.43------
2026.02
42.410.020.170.4630.2040.2747.46
2026.02
42.390.050.170.4010.4290.5197.58
2026.02
42.320.110.130.2110.01360.01334.45
2026.02
41.650.790.1900.00090.00049.03
2026.02
40.71.730.2200012.63
2026.02
32.72------
2026.02
30.132.590.2900020.99
2026.02
17.739.460.4300053.07