Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Policy Evaluation on PolicyBench Level 2 (CN)

62.92Accuracy

Deepseek R1

55.213657.214359.21561.2157Apr 14, 2026
Updated 4d ago

Evaluation Results

MethodLinks
62.92
2026.04
60.57
2026.04
60.47
2026.04
59.79
2026.04
59.74
2026.04
56.56
2026.04
56.39
2026.04
56.34
2026.04
55.81
55.56
55.51