| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tool Use | BEHEMOTH ToolBench (out-of-distribution) | Success Rate26.82 | 6 | |
| Graduate-level Reasoning | BEHEMOTH GPQA Diamond (out-of-distribution) | Accuracy50 | 6 | |
| Long-context Memory Evaluation | BEHEMOTH LongMemEval (out-of-distribution) | Accuracy63.07 | 6 | |
| Memory Extraction | BEHEMOTH in-distribution (test) | Personalization (MA)65.72 | 6 |