Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Automating Agent Evaluation on AgentEvalBench
Loading...
65
Eval@1
EvalAgent
13
26.5
40
53.5
May 12, 2026
Eval@1
Updated 21d ago
Evaluation Results
Method
Method
Links
Eval@1
EvalAgent
LLM=Sonnet 4.5
2026.05
65
EvalAgent
LLM=Haiku 4.5
2026.05
62.5
Agent-Sourcecode (B2)
LLM=Sonnet 4.5
2026.05
60
Agent-Sourcecode (B2)
LLM=Haiku 4.5
2026.05
45
Agent-Onestage (B3)
LLM=Sonnet 4.5
2026.05
35
Agent-Twostage (B4)
LLM=Haiku 4.5
2026.05
32.5
Agent-Twostage (B4)
LLM=Sonnet 4.5
2026.05
30
Agent-Onestage (B3)
LLM=Haiku 4.5
2026.05
17.5
LLM-Singleturn (B1)
LLM=Sonnet 4.5
2026.05
17.5
LLM-Singleturn (B1)
LLM=Haiku 4.5
2026.05
15
Feedback
Search any
task
Search any
task