Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language Model Evaluation on ArxivRollBench 2026a

70.8Valid Accuracy

moonshotai/kimi-k2.6

-2.83216.28435.454.516Jul 25, 2025
Updated 8d ago

Evaluation Results

MethodLinks
2025.07
70.85.67.9
55.255.2100
2025.07
54.454.4100
53.452.999
2025.07
53.325.447.7
5251.899.7
48.348.3100
2025.07
4612.326.8
2025.07
45.32861.8
2025.07
43.214.333
2025.07
42.926.661.9
42.14095
40.840.298.7
2025.07
39.339.3100
2025.07
38.57.419.1
37.637.6100
35.535.499.8
34.534.5100
33.933.9100
31.831.8100
3030100
29.529.499.8
29.329.399.8
28.228.2100
27.727.7100
2025.07
27.419.771.8
26.226.2100
25.825.8100
23.823.899.9
23.722.795.6
23.423.4100
23.223.2100
22.122.1100
21.921.9100
21.621.298.2
21.220.395.6
12.612.6100
2025.07
8.34.858.5
7.57.5100
2025.07
000
00100
00100