Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BB

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical Reasoning (Coding Tools)BB-Hard
Accuracy63.33
25
Mathematical Reasoning (Coding Tools)BB Easy
Accuracy95.12
25
Abstention in Question AnsweringBB Answer Unknown
Abstention F197.9
10
Agent-task matchingBB NonIID
Cumulative Alignment Cost410.04
4
Showing 4 of 4 rows