| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ2-Bench | Accuracy85.4 | 41 | 8d ago | ||
| MiniWob++ (held-in) | ENVISIONS | Performance (%)87.12 | 14 | 3mo ago | |
| Terminal-Bench | DeepSeek V3.2 | Accuracy45 | 12 | 6d ago | |
| GAIA Text-Only | Score84.5 | 7 | 8d ago | ||
| SWE-bench Verified | DeepSeek V3.2 | Accuracy72.1 | 4 | 3mo ago | |
| Terminal Bench Hard English | HyperCLOVA X 32B Think | Score9.9 | 3 | 3mo ago | |
| Terminal Bench English 1.0 | HyperCLOVA X 32B Think | Score21.8 | 3 | 3mo ago | |
| Tau2 Telecom English | HyperCLOVA X 32B Think | Score65.1 | 3 | 3mo ago | |
| Tau2 Retail (English) | HyperCLOVA X 32B Think | Score71.6 | 3 | 3mo ago | |
| Tau2 Airline (English) | HyperCLOVA X 32B Think | Score58 | 3 | 3mo ago | |
| BFCL MULTI_TURN_LIVE | CompassMax-V3-Thinking | Accuracy0.195 | 3 | 3mo ago | |
| BFCL AST LIVE | DeepSeek-R1 | Accuracy80.61 | 3 | 3mo ago | |
| BFCL AST NON LIVE | DeepSeek-R1 | Accuracy87.36 | 3 | 3mo ago | |
| SEAL | Score57.4 | 2 | 8d ago | ||
| BFCL | JT-Safe-V2-35B | Score73.69 | 2 | 8d ago |