Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentEvalBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Automating Agent EvaluationAgentEvalBench
Eval@165
10
Meta-evaluationAgentEvalBench 1.0 (test)
URF85
8
Meta-evaluationAgentEvalBench
URF83.8
8
Showing 3 of 3 rows