| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| LLM-as-a-judge evaluation | FLASK | Pearson's r0.589 | 36 | |
| Direct Assessment | Flask | Pearson Correlation Coefficient0.7203 | 12 | |
| Vulnerability Detection | FLASK | TP5 | 7 | |
| Feedback Evaluation Alignment | FLASK | Kendall's Tau0.405 | 6 | |
| Feedback evaluation | FLASK (test) | Kendall's Tau0.385 | 5 |