Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BIG-bench Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
General ReasoningBig-Bench Hard (BBH) (val)
Accuracy43.46
36
ReasoningBig Bench Hard 3-shot
Accuracy41.46
18
Word SortingBig-Bench Hard Word Sorting
Success Rate79.8
4
CountingBig-Bench Hard Counting
Success Rate91.9
4
Temporal ReasoningBIG-bench Hard Temporal Sequences (test)
Test Accuracy62
4
Showing 5 of 5 rows