Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-task Evaluation on Aggregate (GSM8K, BFCL, Spider, HumanEval)

79.4Average Accuracy

RLSTA

41.85651.60361.3571.097May 26, 2026
Updated 7d ago

Evaluation Results

MethodLinks
2026.05
79.4-
2026.05
79.3-
2026.05
79-
2026.05
78.6-
2026.05
77.5-
2026.05
67.5-
2026.05
66.184.1
2026.05
66-
2026.05
65.6-
2026.05
63.8-
2026.05
63.3-
2026.05
58.286.3
2026.05
57.672.5
2026.05
55.269.9
2026.05
5368.4
2026.05
52.866.5
2026.05
47.973.1
2026.05
46.172.3
2026.05
43.566
2026.05
43.368.4