| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| DecisionBench | Debate-AHP | CR Mean0.2902 | 9 | 3mo ago | |
| Diagnostic (Avg. YouCook2, COIN, CrossTask) (test) | CAST | State Accuracy76.92 | 8 | 2mo ago | |
| RealUnify GEU | MC Score0.32 | 4 | 15d ago | ||
| Unified-Bench | Bagel + Unicot | CLIP Score90.26 | 4 | 15d ago | |
| Consistency Evaluation Dataset (N=720) 1.0 (test) | - | - | 0 | 3mo ago |