| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Document-level phenotype concept recognition | ID-68 | Precision94.11 | 12 | |
| Detoxification | ID | TP Score55 | 6 | |
| Open-ended Dialogue | ID Average | Win Rate72.2 | 4 | |
| LLM response quality prediction | ID Claude 3.5 Haiku 20241022 (test) | RMSE0.45 | 3 |