| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ2-Bench (Tau-bench) Retail and Telecom | Overall Success Rate85.79 | 17 | 4d ago | ||
| Tau-Bench | Qwen3-235B-Instruct-2507 | Retail Score71.3 | 13 | 4d ago | |
| ACEBench (agent-task) | Multi Turn Success Rate97.5 | 13 | 4d ago | ||
| tau^2 Bench official evaluation setting GPT-4.1 simulator | REACT(GPT-5) | Retail Score0.775 | 9 | 4d ago | |
| tau2-bench Airline | CODEDELEGATOR | Pass@163.5 | 6 | 4d ago | |
| tau2-bench Retail | CODEDELEGATOR | Pass@182 | 6 | 4d ago | |
| SpreadSheetBench | Success Rate55.36 | 5 | 4d ago | ||
| Tau2-Telecom | LongCat-Flash-Lite | Avg@872.8 | 4 | 4d ago | |
| Tau2 Retail | LongCat-Flash-Lite | Avg@873.1 | 4 | 4d ago | |
| Tau2-Airline | LongCat-Flash-Lite | Avg@858 | 4 | 4d ago |