Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolicyBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Policy EvaluationPolicyBench Overall Average
Accuracy66.34
11
Policy EvaluationPolicyBench Level 3 US
Accuracy77
11
Policy EvaluationPolicyBench Level 3 CN
Accuracy80.34
11
Policy EvaluationPolicyBench Level 2 (US)
Accuracy68.95
11
Policy EvaluationPolicyBench Level 2 (CN)
Accuracy62.92
11
Policy EvaluationPolicyBench Level 1 (US)
Accuracy59.33
11
Policy EvaluationPolicyBench Level 1 (CN)
Accuracy62.02
11
Policy Question AnsweringPolicyBench
Accuracy64.34
11
Policy Question AnsweringPolicyBench US
Accuracy66.43
11
Policy Question AnsweringPolicyBench Chinese
Accuracy65.33
11
Showing 10 of 10 rows