Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HLE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Logical ReasoningHLE
Accuracy0.7226
46
HLEHLE
Accuracy67.1
25
Long-horizon agentic taskHLE
Performance60
24
Humanities Question AnsweringHLE
HLE Score13.37
24
ReasoningHLE
Accuracy (HLE Reasoning)25.3
23
Knowledge-Intensive ReasoningHLE
Avg Score85
23
General ReasoningHLE
Accuracy38.4
21
Scientific ReasoningHLE
pass@1612
17
High-Level ReasoningHLE
Average Score26.6
17
Mathematical reasoningHLE math
Accuracy23.3
16
Deep researchHLE
Accuracy51
16
Long-horizon agentic tasksHLE Our Settings
Pass@144.4
15
Deep SearchHLE text-only
Score40.8
14
ReasoningHLE
Pass@118.03
14
Deep ResearchHLE text-only original (test)
Pass@132.9
13
Multi-domain knowledge reasoningHLE 500-question ablation
Success Rate (Last)57.3
12
General Deep Research Tool UseHLE
Success Rate42.9
12
High-level Multimodal ReasoningHLE-500
Text Score29.5
12
Hard Reasoning and Language EvaluationHLE
Accuracy36.1
12
Mathematical ReasoningHLE Math-text
Pass@162.8
12
Reasoning & GeneralHLE
Score51.8
11
Compositional ReasoningHLE
Accuracy23.1
11
Long-horizon agentic tasksHLE Full
Pass@145.8
10
Reasoning & GeneralHLE Full
Score (%)0.502
10
Hard LLM ReasoningHLE
Accuracy15.5
10
Showing 25 of 51 rows