| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SimpleToM | GPT-5 | Accuracy99.24 | 29 | 2mo ago | |
| TactfulToM | DeepSeek-R1 | Accuracy69.69 | 26 | 2mo ago | |
| Hi-ToM | SocialR1-8B | Accuracy70.83 | 26 | 2mo ago | |
| MotiveBench | Accuracy94 | 26 | 2mo ago | ||
| EmoBench | Accuracy80.39 | 26 | 2mo ago | ||
| ToMBench Hard (val) | SocialR1-8B | Accuracy62.79 | 26 | 2mo ago | |
| ToMBench | Accuracy78.34 | 26 | 2mo ago | ||
| GRASP-Bench (test) | T1 Accuracy42.9 | 18 | 16d ago | ||
| Sotopia hard | Rel Score2.4 | 17 | 1mo ago | ||
| MotiveBench OOD (test) | GPT-4o | Amazon Score0.9011 | 17 | 3mo ago | |
| TVQA+ | Qwen3-VL-8B + SGR | Accuracy73.2 | 15 | 16d ago | |
| Online-MMSI | Qwen3.5-9B (Thinking Mode) | STI63.1 | 15 | 16d ago | |
| MMSI | Qwen3.5-9B + SGR | STI71.2 | 15 | 16d ago | |
| Sotopia (all) | Rel Score2.73 | 15 | 1mo ago | ||
| SIQA | Autoregressive | Performance (%)15.2 | 6 | 3mo ago | |
| When2Call | AutoAdapt | Accuracy54.5 | 5 | 2mo ago |