| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Just-Eval | Just-Eval Average Score4.83 | 50 | 22d ago | ||
| MMLU | MMLU Score49.5 | 45 | 19d ago | ||
| AgentDojo | GLM-4.5 | Utility78.4 | 32 | 1mo ago | |
| SLIMORCA (test) | TOSS-Pro | Score68.85 | 24 | 3mo ago | |
| MATH500 | RealSafe-R1 | Pass@1 Accuracy93.6 | 22 | 27d ago | |
| GPQA Diamond | Accuracy (pass@1)53 | 22 | 27d ago | ||
| MMLU | SafeDecoding | Accuracy (pass@1)78 | 22 | 27d ago | |
| NQ-Open | CNT | Delta NQ-Open5.13 | 17 | 2mo ago | |
| MMLU | CNT | ΔMMLU0.2 | 17 | 2mo ago | |
| MMLU, GSM8K | PALETTE | MMLU Accuracy70.4 | 16 | 8d ago | |
| Anchor Utility Dataset | CDA | Anchor-PPL5.24 | 16 | 2mo ago | |
| GM | TVAE | Balanced Acc66.6 | 13 | 3mo ago | |
| CR | Balanced Acc68.6 | 13 | 3mo ago | ||
| CC | DP-CTGAN | Balanced Acc67.3 | 13 | 3mo ago | |
| BM | TVAE | Balanced Acc60.3 | 13 | 3mo ago | |
| AD | Balanced Accuracy81.8 | 13 | 3mo ago | ||
| ScienceQA (S-QA) | CMRM_dataset | Accuracy73.2 | 13 | 3mo ago | |
| LLaVA-Bench Coco | ShareGPT4V | Score92.3 | 13 | 3mo ago | |
| Downstream Tasks | DAPT (nontoxic) | Average Accuracy63.4 | 12 | 3mo ago | |
| BC | Balanced Acc72.1 | 11 | 3mo ago | ||
| LM Utility Evaluation Dataset | CB | Utility Score9.12 | 8 | 1mo ago | |
| MMbench and DocVQA (test) | MMbench Score87.02 | 7 | 3mo ago | ||
| XSTest Safe Prompts | FedDPO | Compliance97.2 | 3 | 1mo ago | |
| IQ Dataset | - | - | 0 | 2mo ago |