Share your thoughts, 1 month free Claude Pro on usSee more

SOTA LLM Agent Evaluation benchmarks and papers with code | Wizwand

Share your thoughts, 1 month free Claude Pro on usSee more

LLM Agent Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric	Trend
tau-bench Retail	EVOTOOL	Pass@164.8		38	1mo ago
tau-bench Airline	NoisyAgent	Pass@478		29	7d ago

Showing 2 of 2 rows