Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Large Model Performance Prediction on 285 models on one Math benchmark
Loading...
100
Top-10 Recall
Brute-force Evaluation
20.96
41.48
62
82.52
Feb 12, 2026
Top-10 Recall
Updated 4d ago
Evaluation Results
Method
Method
Links
Top-10 Recall
Brute-force Evaluation
Evaluation budget=100%
2026.02
100
STAR-guided Selection
Evaluation budget=3.5%
2026.02
82
Random Selection
Evaluation budget=75%
2026.02
70
Random Selection
Evaluation budget=50%
2026.02
52
Random Selection
Evaluation budget=25%
2026.02
24
Feedback
Search any
task
Search any
task