| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MiniWob++ (held-in) | ENVISIONS | Performance (%)87.12 | 14 | 4d ago | |
| Terminal-Bench | DeepSeek V3.2 | Accuracy45 | 8 | 4d ago | |
| τ2-Bench | LongCat-Flash Exp-Chat | Accuracy69.5 | 4 | 4d ago | |
| SWE-bench Verified | DeepSeek V3.2 | Accuracy72.1 | 4 | 4d ago | |
| Terminal Bench Hard English | HyperCLOVA X 32B Think | Score9.9 | 3 | 4d ago | |
| Terminal Bench English 1.0 | HyperCLOVA X 32B Think | Score21.8 | 3 | 4d ago | |
| Tau2 Telecom English | HyperCLOVA X 32B Think | Score65.1 | 3 | 4d ago | |
| Tau2 Retail (English) | HyperCLOVA X 32B Think | Score71.6 | 3 | 4d ago | |
| Tau2 Airline (English) | HyperCLOVA X 32B Think | Score58 | 3 | 4d ago | |
| BFCL MULTI_TURN_LIVE | CompassMax-V3-Thinking | Accuracy0.195 | 3 | 4d ago | |
| BFCL AST LIVE | DeepSeek-R1 | Accuracy80.61 | 3 | 4d ago | |
| BFCL AST NON LIVE | DeepSeek-R1 | Accuracy87.36 | 3 | 4d ago |