| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| TL;DR (test) | CW-rDPO | Win Rate68.8 | 36 | 1mo ago | |
| HH-RLHF (test) | CW-IPO | Win Rate87.4 | 36 | 1mo ago | |
| HH-RLHF | SPPO-HPS | BLEU0.275 | 31 | 26d ago | |
| UF-P-4 | SPL | Accuracy (%)62.46 | 20 | 1mo ago | |
| UF-P 2 | SPL | Accuracy63.71 | 20 | 1mo ago | |
| PRISM | CUMA | Win-Rate (DPO)74.5 | 20 | 1mo ago | |
| UFB | CW-DPO | Win Rate83.2 | 18 | 1mo ago | |
| UFB (test) | CW-DPO | Win Rate81.05 | 18 | 1mo ago | |
| Psoups (test) | MetaAligner | Helpfulness (RM)1.39 | 13 | 1mo ago | |
| Anthropic-hh-rlhf (test) | PLC | LLM-as-a-Judge Helpful Score5.83 | 12 | 10d ago | |
| AlpacaEval | AdaBoN | Win Rate52 | 12 | 1mo ago | |
| Ultrafeedback 40% flipping ratio | FA-DPO | Accuracy78.87 | 12 | 1mo ago | |
| Ultrafeedback 20% flipping ratio | FA-DPO | Accuracy78.8 | 12 | 1mo ago | |
| UltraFeedback (test) | FedPDPO | Accuracy74.18 | 11 | 26d ago | |
| PyDPO (test) | FedPDPO | Accuracy94.32 | 11 | 26d ago | |
| WebGPT (test) | FedPDPO | Accuracy61.24 | 11 | 26d ago | |
| AlpacaEval weighted gpt4 turbo 2.0 | GANPO (SimPO) | Win Rate46.11 | 8 | 1mo ago | |
| Board Game Playtesting Dataset | MeepleLM | MAE0.6576 | 8 | 1mo ago | |
| CSQA | Pep | Preference Alignment78.2 | 5 | 1mo ago | |
| SocialIQA | Pep | Preference Alignment87.3 | 5 | 1mo ago | |
| AIME | Pep | Preference Alignment80.1 | 5 | 1mo ago | |
| MedQA | Pep | Preference Alignment77.4 | 5 | 1mo ago | |
| Argilla-7k (test) | MixDPO | LC Win Rate9.23 | 5 | 1mo ago | |
| PRISM 1.0 (test) | Hard Panel | Borda Average2.393 | 5 | 1mo ago | |
| PRISM normalized-step (test) | Hard Panel | Borda Avg2.328 | 5 | 1mo ago |