Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Success Prediction on SWE-bench Pro (held-out)
Loading...
0.909
AUC-ROC
Oracle
0.55748
0.64874
0.74
0.83126
Apr 1, 2026
AUC-ROC
Updated 17d ago
Evaluation Results
Method
Method
Links
AUC-ROC
Oracle
description=standard I...
2026.04
0.909
LLM-as-a-judge
features=LLM-as-a-judge
2026.04
0.696
Combined
features=Combined (Emb...
2026.04
0.677
Embedding
features=Embedding
2026.04
0.668
Baseline
description=predicts t...
2026.04
0.571
Feedback
Search any
task
Search any
task