Share your thoughts, 1 month free Claude Pro on usSee more

Large Model Performance Prediction on 285 models on one Math benchmark

100Top-10 Recall

Brute-force Evaluation

Updated 4mo ago

Evaluation Results

Method	Links
Brute-force Evaluation 2026.02		100
STAR-guided Selection 2026.02		82
Random Selection 2026.02		70
Random Selection 2026.02		52
Random Selection 2026.02		24