Share your thoughts, 1 month free Claude Pro on usSee more

Long-context Language Modeling on RULER (test) (4k-256k Sweep)

96.6Accuracy (4k Context)

Baseline

Updated 3mo ago

Evaluation Results

Method	Links
Baseline 2026.04		96.6	94.1	92.1	88.7	74.3	74.8	41.7
In-Place TTT 2026.04		96.1	95.6	92.7	89.3	78.7	77	43.9
Qwen3-4B (Instruct) 2026.04		95.1	93.6	91	87.8	77.8	66	-
Mistral-7B 2026.04		93.6	91.2	87.2	75.4	49	13.8	-
Phi3-medium-14B 2026.04		93.3	93.2	91.1	86.8	78.6	46.1	-
Llama3-8B 2026.04		92.8	90.3	85.7	79.9	76.3	69.5	-
GLM3-6B 2026.04		87.8	83.4	78.6	69.9	56	42	-