| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Software Engineering | PaperBench | Score66.8 | 9 | |
| Paper-to-Code Reproduction | PaperBench Code (dev) | Final Score78.6 | 9 | |
| Long-horizon Research Task Reproduction | PaperBench Code (dev) | FRE Score72.22 | 7 | |
| ML research engineering | PaperBench | Adaptive Pruning Score33.26 | 6 | |
| Paper-to-code reproduction | PaperBench Code ICML 2024 (dev) | Average Score0.786 | 6 |