Share your thoughts, 1 month free Claude Pro on usSee more

Language Model Evaluation on Pooled tasks Table 5 Llama-3.1 3.3 (various)

57.15Pooled Accuracy Estimate (γ̂)

Llama-3.3 70B Instruct

Updated 5mo ago

Evaluation Results

Method	Links
Llama-3.3 70B Instruct 2026.02		57.15	-	-	-	-	-	-
Llama-3.3 70B Instruct 2026.02		56.96	0.2	0.13	0.0633	0.113	0.0606	3.21
Llama-3.3 70B Instruct 2026.02		56.87	0.29	0.14	0.0251	0.0462	0.0757	4.18
Mistral-Small-3.1-24B-Instruct-2503 2026.02		56.74	-	-	-	-	-	-
Mistral-Small-3.1-24B-Instruct-2503 2026.02		56.65	0.09	0.16	0.307	0.569	0.624	6.83
Llama-3.3 70B Instruct 2026.02		56.55	0.61	0.14	0	0	0.0002	4.18
Mistral-Small-3.1-24B-Instruct-2503 2026.02		55.87	0.87	0.21	0	0.008	0	10.64
Llama-3.1 8B Instruct 2026.02		42.58	-0.14	0.1	0.921	0.993	0.981	2.73
Llama-3.1 8B Instruct 2026.02		42.51	0.079	0.1	0.787	0.7144	0.866	2.76
Llama-3.1 8B Instruct 2026.02		42.48	-0.04	0.19	0.601	0.282	0.549	8.71
Llama-3.1 8B Instruct 2026.02		42.45	-0.02	0.07	0.629	0.804	0.868	1.31
Llama-3.1 8B Instruct 2026.02		42.45	-0.01	0.1	0.564	0.758	0.888	2.4
Llama-3.1 8B Instruct 2026.02		42.43	-	-	-	-	-	-
Llama-3.1 8B Instruct 2026.02		42.41	0.02	0.17	0.463	0.204	0.274	7.46
Llama-3.1 8B Instruct 2026.02		42.39	0.05	0.17	0.401	0.429	0.519	7.58
Llama-3.1 8B Instruct 2026.02		42.32	0.11	0.13	0.211	0.0136	0.0133	4.45
Llama-3.1 8B Instruct 2026.02		41.65	0.79	0.19	0	0.0009	0.0004	9.03
Llama-3.1 8B Instruct 2026.02		40.7	1.73	0.22	0	0	0	12.63
Llama-3.1 8B (non-instruct) 2026.02		32.72	-	-	-	-	-	-
Llama-3.1 8B (non-instruct) 2026.02		30.13	2.59	0.29	0	0	0	20.99
Llama-3.3 70B Instruct 2026.02		17.7	39.46	0.43	0	0	0	53.07