Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ScienceAgentBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Scientific Agent Task CompletionScienceAgentBench
Success Rate (SR)43.1
40
Defect DetectionScienceAgentBench 12 confirmed defects, 102 tasks
Recall Average (RecA)100
12
Scientific Code GenerationScienceAgentBench
SR25.5
10
Scientific Code GenerationScienceAgentBench (test)
SR27.5
8
Scientific Agent TaskScienceAgentBench (test)
Success Rate (SR)18.6
6
Showing 5 of 5 rows