Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BBEH

Benchmarks

Task NameDataset NameSOTA ResultTrend
General ReasoningBBEH
Accuracy78.8
39
Logical ReasoningBBEH
Accuracy58.9
21
ReasoningBBEH (test)
Accuracy34.5
14
LLM RoutingBBEH (val)
Top-1 Acc66.4
14
LLM RoutingBBEH
Top-1 Accuracy66.4
14
ReasoningBBEH mini
Pass@114.8
13
ReasoningBBEH
Accuracy81.2
12
Algorithmic ReasoningBBEH Mini
Accuracy17.8
11
ReasoningBBEH
pass@115.31
11
ReasoningBBEH
Accuracy75.8
7
Adding MistakeBBEH
AOC67.2
7
Truncated CoT AnsweringBBEH
AOC0.665
7
Showing 12 of 12 rows