Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Long-context language modeling evaluation on HELMET held-out eval

57.61Accuracy (8K Context)

Qwen 2.5 32B

41.146845.420949.69553.9691Dec 15, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.12
57.6156.0654.0141.73
2025.12
52.1149.3648.643.15
2025.12
51.6249.947.71-
2025.12
50.5749.6846.01-
2025.12
49.4149.7147.4643.34
2025.12
49.3749.9250.3148.6
2025.12
49.2646.2542.9930.47
2025.12
46.0943.7141.2635.12
2025.12
45.6643.6241.1536.8
2025.12
4543.4842.4440.18
2025.12
44.7244.641.0735.67
2025.12
43.1941.6339.3135.74
2025.12
41.7842.941.8241.48