Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Long-horizon Research Task Reproduction on PaperBench Code (dev)

72.22FRE Score

Claude 4.5 Sonnet

13.574428.799744.02559.2503Feb 19, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.02
72.2286.7670.4264.2857.3192.7536.782.7463.0381.8542.0183.3579.3754.8951.8183.3868.4475.6590.8957.1469.75
2026.02
70.768.740.362.7960.5859.2935.8923.0263.294.0729.1361.0166.0615.514.9582.4234.0772.8544.7142.9347.11
2026.02
69.5558.5235.0665.6428.157.235.0571.0161.2923.2362.5752.0959.7832.121.3579.7628.9374.6583.0827.2651.31
2026.02
62.3456.3224.6415.9118.825237.3458.5479.6168.1542.2125.9238.0227.9734.6646.0755.4867.4176.3656.2347.2
2026.02
61.0475.6551.2580.0970.1480.1342.7156.9369.6156.1454.6271.4163.6355.2768.468.084782.1262.0235.6162.59
2026.02
33.2368.756.8548.9439.0773.4449.647.7751.557.2540.4375.5428.563.7148.8164.2532.969.9362.6133.1552.31
2026.02
15.8321.6728.8744.151.546.9723.9611.9834.1318.2934.3874.0419.4220.712.5651.5514.3421.6768.4420.0528.72