| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Model Evaluation | ArxivRollBench 2026a | Valid Accuracy70.8 | 42 | |
| Research Paper Reasoning and Comprehension | ArxivRollBench 2025a (val) | Valid Accuracy44.3 | 38 | |
| Biased overtraining evaluation | ArxivRollBench | RSII0.22 | 14 | |
| Contamination Evaluation | ArxivRollBench | Absolute RS_I0.48 | 14 |