Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-task Evaluation on Average (GSM8K-CoT, MATH, MBPP, HumanEval)
Loading...
51.6
Accuracy
TAD-Q
43.8
45.825
47.85
49.875
May 10, 2026
Accuracy
TPF
AUP
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
TPF
AUP
TAD-Q
Backbone=LLaDA, Config...
2026.05
51.6
5.08
225.2
TAD-S
Backbone=LLaDA, Config...
2026.05
49.9
5.76
257.1
LLaDA
Backbone=LLaDA-Instruct
2026.05
46.2
1
46.2
d3LLM
Backbone=LLaDA
2026.05
45.9
6.25
206.1
Fast-dLLM
Backbone=LLaDA
2026.05
45.5
2.36
84.7
dParallel
Backbone=LLaDA
2026.05
45.5
3.9
128.6
D2F
Backbone=LLaDA
2026.05
44.1
2.47
87.9
Feedback
Search any
task
Search any
task