Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent success prediction on GSO (held-out)
Loading...
91.1
AUC-ROC
Oracle
62.604
70.002
77.4
84.798
Apr 1, 2026
AUC-ROC
Updated 17d ago
Evaluation Results
Method
Method
Links
AUC-ROC
Oracle
description=standard I...
2026.04
91.1
LLM-as-a-judge
features=LLM-as-a-judge
2026.04
73.5
Embedding
features=Embedding
2026.04
72
Combined
features=Combined (Emb...
2026.04
71.9
Baseline
description=predicts t...
2026.04
63.7
Feedback
Search any
task
Search any
task