| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| DialogSum | Aligner | Reasoning99.1 | 33 | 3mo ago | |
| DeepfakeJudge Reason 1.0 (test) | BLEU-19 | 16 | 3mo ago | ||
| MR-Ben | Qwen2.5-Math-PRM-7B-PDDL-r | Math F156.1 | 10 | 1mo ago | |
| ParaRev | Full DRO | WR vs. Base63.7 | 8 | 23d ago | |
| 109-sample (test) | Universe Routing | Accuracy97.25 | 7 | 2mo ago | |
| Limit Texas Hold’em | Hit Rate2 | 6 | 3mo ago | ||
| Leduc Hold’em | Hit Rate (HR)2 | 6 | 3mo ago | ||
| Full task set (n=45) | Overall Score8.84 | 5 | 2mo ago |