| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| tau2-bench Airline | CODEDELEGATOR | Pass@163.5 | 22 | 1mo ago | |
| tau2-bench Retail | CODEDELEGATOR | Pass@182 | 22 | 22d ago | |
| AppWorld (Challenge) | RCL (all primitives) | TGC83.7 | 20 | 13d ago | |
| AppWorld Normal | RCL (all primitives) | Task Goal Completion (TGC)89.3 | 20 | 13d ago | |
| τ2-Bench (Tau-bench) Retail and Telecom | Overall Success Rate85.79 | 17 | 1mo ago | ||
| Tau-Bench | Qwen3-235B-Instruct-2507 | Retail Score71.3 | 13 | 1mo ago | |
| ACEBench (agent-task) | Multi Turn Success Rate97.5 | 13 | 1mo ago | ||
| tau^2 Bench official evaluation setting GPT-4.1 simulator | REACT(GPT-5) | Retail Score0.775 | 9 | 1mo ago | |
| Tau2-Telecom | LongCat-Flash-Lite | Avg@872.8 | 8 | 19d ago | |
| Tau2 Retail | LongCat-Next | Avg@873.68 | 8 | 19d ago | |
| Tau2-Airline | LongCat-Flash-Lite | Avg@858 | 8 | 19d ago | |
| τ²-Bench Telecom | Accuracy99.3 | 5 | 6d ago | ||
| τ²-Bench Airline | Accuracy67.5 | 5 | 6d ago | ||
| τ²-Bench Retail | Accuracy84.7 | 5 | 6d ago | ||
| SpreadSheetBench | Success Rate55.36 | 5 | 1mo ago | ||
| tau2-Bench | PivotRL | Accuracy64 | 2 | 25d ago |