Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large Language Model Evaluation on 10 tasks average

70.56Avg Accuracy

DeltaLoss-only

57.8261.127564.43567.7425Dec 4, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.12
70.56100.8
2025.12
70.5100.71
2025.12
70.31100.45
2025.12
70.1100.15
2025.12
70-
2025.12
69.9699.95
2025.12
69.7899.7
2025.12
69.3299.04
2025.12
69.0198.6
2025.12
68.7198.16
2025.12
67.21100.32
2025.12
67.17100.25
2025.12
67-
2025.12
66.9299.88
2025.12
66.8999.83
2025.12
66.8699.79
2025.12
66.6199.42
2025.12
66.0598.58
2025.12
6698.51
2025.12
65.67-
2025.12
65.4599.67
2025.12
65.399.43
2025.12
65.0797.12
2025.12
65.0499.03
2025.12
64.5198.24
2025.12
64.16-
2025.12
64.1299.93
2025.12
64.0697.55
2025.12
64.0497.52
2025.12
63.6499.18
2025.12
63.3796.5
2025.12
63.24-
2025.12
63.1998.49
2025.12
63.1396.13
2025.12
62.5797.51
2025.12
62.5498.89
2025.12
62.3397.14
2025.12
62.2898.49
2025.12
62.1998.34
2025.12
6298.04
2025.12
61.8997.87
2025.12
61.4597.18
2025.12
61.3495.59
2025.12
60.7294.64
2025.12
60.6494.5
2025.12
60.6292.32
2025.12
60.2595.28
2025.12
59.8194.58
2025.12
58.5492.57
2025.12
58.3190.88