Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Operating System ControlAgentBench OS
Accuracy37.6
15
Tool-calling for Clinical Question AnsweringAgentBench FHIR (val)
Score76.7
8
Success RateAgentBench
Success Rate34.1
8
Human CorrelationAgentBench
Pearson r0.77
8
Sequential task management and state maintenanceLifelong AgentBench
Accuracy100
5
Web ShoppingAgentBench Web Shopping
Task Completion Score (TCS)36.2
4
Web Browsing/Site navigationAgentBench WS
Task Completion Score (TCS)52
4
Long-term PlanningAgentBench LTP
Task Completion Score (TCS)32.3
4
Digital Card Game strategyAgentBench DCG
Task Completion Score (TCS)73.6
4
Knowledge Graph navigationAgentBench KG
Task Completion Score (TCS)72.7
4
Database interactionAgentBench DB
Task Completion Score (TCS)74.3
4
Showing 11 of 11 rows