Tool benchmark

Benchmarks

Task Name	Dataset Name	SOTA Result	Trend
Reasoning Quality Evaluation	120-tool benchmark 500 tasks simulated	Mean Score4.43		5

Showing 1 of 1 rows