| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Utterance Classification | Phase 2 debugging Overall 13 turns | F1 Score91 | 12 | |
| Relational inference | PHASE | Graph Accuracy79.21 | 10 | |
| Trajectory Prediction | PHASE (test) | ADE0.801 | 10 | |
| Profile Classification | Phase 10x10 grid 3 | Profile Accuracy48.9 | 7 | |
| Criterion Validity Analysis | Phase 2 | Spearman's Rho0.351 | 6 | |
| Criterion Validity Analysis | Phase 1 | Spearman's rho0.607 | 6 | |
| Direct Verifier Evaluation | Phase 2 (test) | Actual Accuracy15 | 4 | |
| Trial outcome prediction | Phase 2 | Log Loss0.629 | 3 | |
| Trial outcome prediction | Phase 1 (trials) | Log Loss0.565 | 3 |