Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Policy Evaluation on PolicyBench Level 1 (US)

59.33Accuracy

Deepseek R1

45.882849.373952.86556.3561Apr 14, 2026
Updated 4d ago

Evaluation Results

MethodLinks
59.33
2026.04
58.76
2026.04
58.68
2026.04
57.73
2026.04
54.9
2026.04
53.71
2026.04
52.69
2026.04
52.55
50.12
49.91
2026.04
46.4