| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ESConv | KEMI | Win Rate70 | 28 | 4d ago | |
| CulturalVQA OOD (test) | MMBoundary | Faithfulness7.66 | 6 | 4d ago | |
| ScienceVQA (test) | MMBoundary | Faithfulness Score8.35 | 6 | 4d ago | |
| A-OKVQA (test) | MMBoundary | Faithfulness Score7.83 | 6 | 4d ago | |
| UltraFeedback 50 sampled questions | OTPO | Win Rate (Expert 1)62 | 5 | 4d ago | |
| Human Evaluation | Ann Brown | Trustworthiness0.86 | 4 | 4d ago | |
| MathQA | Ours | Accuracy89.2 | 3 | 4d ago | |
| 50 randomly selected model responses | GPT-4.1 | Clarity98 | 3 | 4d ago | |
| Human Evaluation Set (test) | LongDPO | Win Rate0.65 | 3 | 4d ago | |
| 200 human-generated instructions | Olympus | Success Rate0.865 | 3 | 4d ago | |
| HH dataset | RRHF_DP | Win Rate59 | 3 | 4d ago | |
| MS MARCO (test) | RBG | Preference: FiD18 | 3 | 4d ago | |
| K/DA and K-OMG (50 random samples) | Overall Offensiveness Score3.24 | 2 | 4d ago | ||
| LongBench Chat | LongReward + DPO | Helpfulness Win Rate14 | 1 | 4d ago | |
| WMT English-Czech 2019 | binmt | Preference: Much Better0.5 | 1 | 4d ago | |
| Tools 100 pairs | DiffLM | Win Rate88 | 1 | 4d ago |