| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Conversation Summarization | Stack | QAGS57.75 | 25 | |
| Opinion Diversity Coverage | Stack | Coverage75 | 15 | |
| Stack push-pop state tracking | Stack | Accuracy99.98 | 12 | |
| Abstractive Summarization | Stack ConvoSumm 1.0 (test) | ROUGE-139.73 | 11 | |
| Object Stacking | Stack Composition C (test) | Success Rate93.7 | 10 | |
| Object Stacking | Stack Spuriousness S (test) | Success Rate97.6 | 10 | |
| Object Stacking | Stack In-distribution I (test) | Success Rate97.2 | 10 | |
| Robotic Manipulation | Stack Shifted Environment (test) | Testing Reward0.77 | 8 | |
| Dynamic Link Prediction | Stack ubuntu (inductive) | AUC-ROC83.29 | 7 | |
| Dynamic Link Prediction | Stack elec (inductive) | AUC-ROC86.07 | 7 | |
| Dynamic Link Prediction | Stack ubuntu (transductive) | AUC-ROC96.49 | 7 | |
| Dynamic Link Prediction | Stack elec (transductive) | AUC-ROC97.98 | 7 | |
| Classification | Stack Social axes V2 (test) | Group A Accuracy70.5 | 5 | |
| Task Planning | Stack 1.0 (test) | Average Planning Time Cost (s)5.94 | 3 | |
| Class Invariant Synthesis | stack | Total Invariants6 | 1 |