Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Evaluation Dataset (20 agents x 2 requirement types)
Loading...
0.68
Time (min)
LLM-Singleturn
0.3072
2.8236
5.34
7.8564
May 12, 2026
Time (min)
Tokens (K)
Updated 21d ago
Evaluation Results
Method
Method
Links
Time (min)
Tokens (K)
LLM-Singleturn
LLM=Haiku 4.5
2026.05
0.68
50.26
LLM-Singleturn
LLM=Sonnet 4.5
2026.05
1.36
49.99
EvalAgent
LLM=Haiku 4.5
2026.05
3.98
2,872.07
Agent-Sourcecode
LLM=Sonnet 4.5
2026.05
4.03
869.36
EvalAgent
LLM=Sonnet 4.5
2026.05
4.18
2,094.98
Agent-Onestage
LLM=Sonnet 4.5
2026.05
4.39
1,698.44
Agent-Sourcecode
LLM=Haiku 4.5
2026.05
4.54
2,196.16
Agent-Onestage
LLM=Haiku 4.5
2026.05
5.3
3,579.46
Agent-Twostage
LLM=Haiku 4.5
2026.05
6.72
4,049.63
Agent-Twostage
LLM=Sonnet 4.5
2026.05
10
3,023.59
Feedback
Search any
task
Search any
task