| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tool-use Reasoning | Ours (test) | Solve Precision (P)52.78 | 27 | |
| Causal Discovery | Ours Noisy | AUROC82.3 | 9 | |
| Causal Discovery | Ours Original | AUROC0.821 | 9 | |
| Instruction Following Evaluation | Ours hard seed data | Score56.73 | 5 | |
| Language Detoxification | Ours (test) | Overall Offensiveness Score1.145 | 5 | |
| Makeup Transfer | Ours (test) | FID11.67 | 4 | |
| Fine-grained Score Accuracy | Ours | Exact Accuracy70.56 | 1 |