Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tool benchmark

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reasoning Quality Evaluation120-tool benchmark 500 tasks simulated
Mean Score4.43
5
Showing 1 of 1 rows