Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Operating System ControlAgentBench OS
Accuracy34.7
12
Success RateAgentBench
Success Rate34.1
8
Human CorrelationAgentBench
Pearson r0.77
8
Showing 3 of 3 rows