Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ArxivRollBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language Model EvaluationArxivRollBench 2026a
Valid Accuracy70.8
42
Research Paper Reasoning and ComprehensionArxivRollBench 2025a (val)
Valid Accuracy44.3
38
Biased overtraining evaluationArxivRollBench
RSII0.22
14
Contamination EvaluationArxivRollBench
Absolute RS_I0.48
14
Showing 4 of 4 rows