Share your thoughts, 1 month free Claude Pro on usSee more

Language Modeling on Broad evaluation suite unseen S1 (dev)

74.2Average Accuracy

all-FA

Updated 3mo ago

Evaluation Results

Method	Links
all-FA 2026.04		74.2	100	1
Idealized\|All–18 2026.04		71.8	97	2
Reg\|Lklhd–26 2026.04		71.1	96	2.9
Reg\|Lklhd–18 2026.04		69.7	94	4.8
Idealized\|Lklhd–6 2026.04		66.8	90	6.2
Idealized\|All–6 2026.04		65.3	88	6.1
Reg\|Lklhd–13 2026.04		60.2	81	6.9
Reg\|Lklhd–10 2026.04		57.2	77	10.7