| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety evaluation | SAGE-Eval | Safety90 | 18 | |
| LLM-as-a-Judge Robustness | Sage (Hard) | Factuality (IPI)55.9 | 13 | |
| LLM-as-a-Judge Robustness | Sage Easy | Factuality Error (IPI)0.059 | 13 | |
| Open-Ended Question Answering | SAGE Web Search | Weighted Recall (Com. Sci.)35.1 | 12 | |
| Short-Form Question Answering | SAGE Web Search | Accuracy (Com. Sci.)63.3 | 12 | |
| Multi-hop Question Answering | SAGE Small-scale (evaluation) | # Search4.9 | 1 |