Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PaperBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Software EngineeringPaperBench
Score66.8
9
Paper-to-Code ReproductionPaperBench Code (dev)
Final Score78.6
9
Long-horizon Research Task ReproductionPaperBench Code (dev)
FRE Score72.22
7
ML research engineeringPaperBench
Adaptive Pruning Score33.26
6
Paper-to-code reproductionPaperBench Code ICML 2024 (dev)
Average Score0.786
6
Showing 5 of 5 rows