Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Policy Evaluation on PolicyBench Overall Average

66.34Accuracy

Deepseek R1

57.884860.079962.27564.4701Apr 14, 2026
Updated 4d ago

Evaluation Results

MethodLinks
66.34
2026.04
64.13
2026.04
63.82
2026.04
63.75
2026.04
62.97
2026.04
61.67
2026.04
60.1
2026.04
59.47
2026.04
59.17
59.1
58.21