Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language Model Evaluation on BenchPress short-context (test)

68.84Accuracy

Qwen3-8B

32.065641.612851.1660.7072Oct 23, 2025
Updated 22d ago

Evaluation Results

MethodLinks
2025.10
68.84
2025.10
67.69
2025.10
66.14
2025.10
65.91
2025.10
65.06
2025.10
64.21
2025.10
64.05
2025.10
63.58
2025.10
62.94
2025.10
62.51
2025.10
62.5
2025.10
61.28
2025.10
60.99
2025.10
60.3
2025.10
60.03
2025.10
59.79
2025.10
59.67
2025.10
59.35
2025.10
58.58
2025.10
57.54
2025.10
57.28
2025.10
57.12
2025.10
56.94
2025.10
56.77
2025.10
56.62
2025.10
55.68
2025.10
55.5
2025.10
55.26
2025.10
55.17
2025.10
54.7
2025.10
54.36
2025.10
54.31
2025.10
54.17
2025.10
54.05
2025.10
53.62
2025.10
53.5
2025.10
52.94
2025.10
52.82
2025.10
52.77
2025.10
52.55
2025.10
52.23
2025.10
52.22
2025.10
51.9
2025.10
51.16
2025.10
50.89
2025.10
50.1
2025.10
49.99
2025.10
49.52
2025.10
49.31
2025.10
49.18
2025.10
48.92
2025.10
48.85
2025.10
48.84
2025.10
48.04
2025.10
47.07
2025.10
46.9
2025.10
46.63
2025.10
46.36
2025.10
46.13
2025.10
46.06
2025.10
45.59
2025.10
45.47
2025.10
45.41
2025.10
45.13
2025.10
45.07
2025.10
43.67
2025.10
43.62
2025.10
43.01
2025.10
42.76
2025.10
42.66
2025.10
41.91
2025.10
41.79
2025.10
41.31
2025.10
40.45
2025.10
40.44
2025.10
39.95
2025.10
39.5
2025.10
39.36
2025.10
39.16
2025.10
39.09
2025.10
38.84
2025.10
38.72
2025.10
38.68
2025.10
38.51
2025.10
38.3
2025.10
37.8
2025.10
37.05
2025.10
36.61
2025.10
36.14
2025.10
36.14
2025.10
36.11
2025.10
35.49
2025.10
34.93
2025.10
34.83
2025.10
34.75
2025.10
34.33
2025.10
34.3
2025.10
33.94
2025.10
33.63
2025.10
33.48
Showing 100 of 131 rows