Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BBEH

Benchmarks

Task NameDataset NameSOTA ResultTrend
General ReasoningBBEH
Accuracy78.8
64
Logical ReasoningBBEH
Accuracy58.9
27
ReasoningBBEH
pass@115.7
23
Causal ReasoningBBEH
Accuracy (Causal Reasoning)55.2
14
ReasoningBBEH (test)
Accuracy34.5
14
LLM RoutingBBEH (val)
Top-1 Acc66.4
14
LLM RoutingBBEH
Top-1 Accuracy66.4
14
ReasoningBBEH mini
Pass@114.8
13
ReasoningBBEH
Accuracy81.2
12
Algorithmic ReasoningBBEH Mini
Accuracy17.8
11
ReasoningBBEH
Accuracy75.8
7
Adding MistakeBBEH
AOC67.2
7
Truncated CoT AnsweringBBEH
AOC0.665
7
Logical ReasoningBBEH mini
Accuracy17
6
Web of LiesBBEH Web of Lies
Accuracy90.12
3
Dyck LanguagesBBEH Dyck Languages
Accuracy91.25
3
Disambiguation QABBEH Disambiguation QA
Accuracy65.41
3
Showing 17 of 17 rows