Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BIG-bench Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
General ReasoningBig-Bench Hard (BBH) (val)
Accuracy43.46
36
ReasoningBig Bench Hard 3-shot
Accuracy41.46
18
Word SortingBig-Bench Hard Word Sorting
Success Rate79.8
4
CountingBig-Bench Hard Counting
Success Rate91.9
4
Temporal ReasoningBIG-bench Hard Temporal Sequences (test)
Test Accuracy62
4
Causal ReasoningBIG-Bench Hard Causal Judgment (OOD)
Accuracy60.4
3
Spatial ReasoningBIG-Bench Hard Navigate (OOD)
Accuracy57.6
3
Showing 7 of 7 rows