Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-task Reasoning on Average (2WikiMultiHop, MMLU, GSM8k) (in-distribution)

75.2Accuracy

CONCUR

2.753621.561840.3759.1782Oct 31, 2025Nov 6, 2025Nov 13, 2025Nov 20, 2025Nov 26, 2025Dec 3, 2025Dec 10, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.12
75.236.5
2025.12
74.341.63
2025.12
74.340.52
2025.12
73.434.21
2025.12
53.510.2
2025.10
41.29-
2025.10
39.79-
2025.10
36.94-
2025.10
36.09-
2025.10
35.99-
2025.10
35.8-
2025.10
35.74-
2025.10
35.62-
2025.10
35.55-
2025.10
35.3-
2025.10
34.88-
2025.10
33.82-
2025.10
33.7-
2025.10
33.24-
2025.10
32.49-
2025.10
32.43-
2025.10
31.46-
2025.10
31-
2025.10
11.38-
2025.10
9.35-
2025.10
9.27-
2025.10
8.35-
2025.10
8.02-
2025.10
5.54-