Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Question Answering on HLE-Verified Gold (test)
Loading...
60
Accuracy
ATLAS-MM
47.52
50.76
54
57.24
Jun 1, 2026
Accuracy
Updated 1d ago
Evaluation Results
Method
Method
Links
Accuracy
ATLAS-MM
Backbone=Claude Sonnet...
2026.06
60
Self-Refine (no early stop)
Backbone=Claude Sonnet...
2026.06
58
ATLAS
Backbone=Claude Sonnet...
2026.06
56
Self-Refine
Backbone=Claude Sonnet...
2026.06
53
GPT-5.2-H
Protocol=Zero-shot
2026.06
52.48
Reward-model reranking
Backbone=Claude Sonnet...
2026.06
52
Budget Forcing
Backbone=Claude Sonnet...
2026.06
51
Opus-4.6
Protocol=Zero-shot
2026.06
50.16
Gemini-3-Pro
Protocol=Zero-shot
2026.06
48.93
Qwen3-Max-T
Protocol=Zero-shot
2026.06
48.48
Opus-4.5
Protocol=Zero-shot
2026.06
48.16
Pass@1
Backbone=Claude Sonnet...
2026.06
48
Feedback
Search any
task
Search any
task