| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Software Engineering | PaperBench Code (dev) | Score78.2 | 15 | |
| Repository-level paper reproduction | PaperBench Code lite (dev) | MU-DPO49.26 | 12 | |
| Experiment Reproduction | PaperBench Code (dev) | Score64.1 | 9 | |
| Software Engineering | PaperBench | Score66.8 | 9 | |
| Paper-to-Code Reproduction | PaperBench Code (dev) | Final Score78.6 | 9 | |
| Long-horizon Research Task Reproduction | PaperBench Code (dev) | FRE Score72.22 | 7 | |
| ML research engineering | PaperBench | Adaptive Pruning Score33.26 | 6 | |
| Paper-to-code reproduction | PaperBench Code ICML 2024 (dev) | Average Score0.786 | 6 | |
| Detail Knowledge Extraction | PaperBench Category B: Detail | Accuracy92.6 | 2 | |
| Fidelity Knowledge Extraction | PaperBench Category A: Fidelity | Accuracy96.7 | 2 |