| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Policy Evaluation | PolicyBench Overall Average | Accuracy66.34 | 11 | |
| Policy Evaluation | PolicyBench Level 3 US | Accuracy77 | 11 | |
| Policy Evaluation | PolicyBench Level 3 CN | Accuracy80.34 | 11 | |
| Policy Evaluation | PolicyBench Level 2 (US) | Accuracy68.95 | 11 | |
| Policy Evaluation | PolicyBench Level 2 (CN) | Accuracy62.92 | 11 | |
| Policy Evaluation | PolicyBench Level 1 (US) | Accuracy59.33 | 11 | |
| Policy Evaluation | PolicyBench Level 1 (CN) | Accuracy62.02 | 11 | |
| Policy Question Answering | PolicyBench | Accuracy64.34 | 11 | |
| Policy Question Answering | PolicyBench US | Accuracy66.43 | 11 | |
| Policy Question Answering | PolicyBench Chinese | Accuracy65.33 | 11 |