Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Generative Language Modeling and Problem Solving on IFEval, AIME25, GSM8K, GPQA, HumanEval, LCB Suite
Loading...
90.4
IFEval Score
Original
87.696
88.398
89.1
89.802
Apr 6, 2026
IFEval Score
AIME25 Score
GSM8K Accuracy
GPQA Score
HumanEval Pass Rate
LCB Suite Score
General Performance Score
Updated 12d ago
Evaluation Results
Method
Method
Links
IFEval Score
AIME25 Score
GSM8K Accuracy
GPQA Score
HumanEval Pass Rate
LCB Suite Score
General Performance Score
Original
Experts (N)=128, Calib...
2026.04
90.4
56.7
89.3
47
93.3
48.6
70.9
REAM
Experts (N)=96, Calibr...
2026.04
89.9
60
86.3
38.4
93.3
51
69.8
REAP
Experts (N)=96, Calibr...
2026.04
89.6
50
87.9
39.4
94.5
50.3
68.6
HC-SMoE
Experts (N)=96, Calibr...
2026.04
88.2
60
84.7
34.3
91.5
45.9
67.4
Freq
Experts (N)=96, Calibr...
2026.04
87.8
60
82.9
36.9
93.9
44
67.6
Feedback
Search any
task
Search any
task