| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction following | IHEval | PLA86.7 | 21 | |
| Task Execution | IHEval | Language Detection (Reference)100 | 12 | |
| Tool Use | IHEval Overall Tool Use v1 (All) | Average Accuracy69.9 | 12 | |
| Slack User | IHEval v1 (Conflict) | Accuracy80 | 12 | |
| Slack User | IHEval v1 (Aligned) | Accuracy83 | 12 | |
| Slack User | IHEval v1 (Reference) | Accuracy94 | 12 | |
| Get Webpage | IHEval v1 (Conflict) | Accuracy39.8 | 12 | |
| Get Webpage | IHEval Aligned v1 | Accuracy55.9 | 12 | |
| Get Webpage | IHEval v1 (Reference) | Accuracy86 | 12 | |
| Rule Following | IHEval Single-Turn | Accuracy (Reference)88.5 | 12 | |
| Rule Following | IHEval Multi-Turn | Accuracy (Reference)89.8 | 12 | |
| Safety Evaluation | IHEval Average 1.0 | Average Accuracy66.9 | 12 | |
| Prompt Hijacking | IHEval Prompt Hijacking Conflict 1.0 | Accuracy45 | 12 | |
| Prompt Hijacking | IHEval Prompt Hijacking - Alignment 1.0 | Accuracy82.5 | 12 | |
| Prompt Hijacking | IHEval Prompt Hijacking 1.0 (Reference) | Accuracy97.5 | 12 | |
| Prompt Extraction | IHEval Prompt Extraction - Conflict 1.0 | Accuracy59.6 | 12 | |
| Prompt Extraction | IHEval Prompt Extraction Alignment 1.0 | Accuracy83.7 | 12 | |
| Prompt Extraction | IHEval Prompt Extraction 1.0 (Reference) | Accuracy96.9 | 12 | |
| Prompt Injection Detection | IHEval Tool-use | FPR0 | 6 | |
| Prompt Injection Detection | IHEval Rule-following | FPR0.01 | 6 |