AgentBench

Benchmarks

Task Name	Dataset Name	SOTA Result
Operating System Control	AgentBench OS	Accuracy37.6	15
Sequential task management and state maintenance	Lifelong AgentBench	Accuracy100	14
Tool-calling for Clinical Question Answering	AgentBench FHIR (val)	Score76.7	8
Success Rate	AgentBench	Success Rate34.1	8
Human Correlation	AgentBench	Pearson r0.77	8
Benchmark-topology generalisation	AgentBench topology simulation seed=42 (n=50 per topology)	ρ(B)-reduction1.39	7
Web Shopping	AgentBench Web Shopping	Task Completion Score (TCS)36.2	4
Web Browsing/Site navigation	AgentBench WS	Task Completion Score (TCS)52	4
Long-term Planning	AgentBench LTP	Task Completion Score (TCS)32.3	4
Digital Card Game strategy	AgentBench DCG	Task Completion Score (TCS)73.6	4
Knowledge Graph navigation	AgentBench KG	Task Completion Score (TCS)72.7	4
Database interaction	AgentBench DB	Task Completion Score (TCS)74.3	4
Web/tool tasks	AgentBench 700 (test)	Success Rate79.7	3
Knowledge-graph traversal	AgentBench FB15k-237	Success Rate (SR)83.3	2
Deliberative-reasoning verification	AgentBench DB-Bench 20 held-out tasks	CheckBeforeSubmit (CBS)16.7	1

Showing 15 of 15 rows