Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HLE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Knowledge-Intensive ReasoningHLE
Avg Score85
75
Math ReasoningHLE Math-100
Pass@135.84
68
ReasoningHLE
Accuracy (HLE Reasoning)40.8
63
Logical ReasoningHLE
Accuracy0.7226
62
Long-horizon agentic taskHLE
Performance60
41
ReasoningHLE
Score64.7
39
Multimodal ReasoningHLE
Accuracy48.8
33
Scientific ReasoningHLE (test)
Pass@149
25
High-Level Expert Knowledge EvaluationHLE Gold 149
Accuracy (Bio)80.5
25
HLEHLE
Accuracy67.1
25
Humanities Question AnsweringHLE
HLE Score13.37
24
General ReasoningHLE
Accuracy38.4
21
General and STEM reasoningHLE
Pass@18.12
20
ReasoningHLE
Head-to-head Win %100
20
Scientific ReasoningHLE
pass@1612
17
High-Level ReasoningHLE
Average Score26.6
17
ReasoningHLE
Accuracy50.2
16
Mathematical reasoningHLE math
Accuracy23.3
16
Deep researchHLE
Accuracy51
16
Long-horizon agentic tasksHLE Our Settings
Pass@144.4
15
Mathematical ReasoningHLE decontaminated
Accuracy8.4
14
ReasoningHLE OOD
Accuracy38.6
14
ReasoningHLE (test)
Accuracy26
14
Deep SearchHLE text-only
Score40.8
14
ReasoningHLE
Pass@118.03
14
Showing 25 of 71 rows