| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | NQ | Accuracy70 | 123 | |
| Hallucination Detection | NQ | AUC0.8645 | 102 | |
| Question Answering | NQ (test) | AUROC83 | 90 | |
| Question Answering | NQ | Absolute Execution Time Overhead (s)0.064 | 90 | |
| Question Answering | NQ | PRR0.65 | 90 | |
| Question Answering | NQ (test) | EM Accuracy66.4 | 86 | |
| Hallucination Detection | NQ (test) | AUC ROC95.2 | 84 | |
| RAG Performance Prediction | NQ-Open | QE5 Score0.793 | 80 | |
| Question Answering | NQ | ACE Score0.496 | 70 | |
| Question Answering | NQ | ASR99.65 | 70 | |
| Question Answering | NQ | EM79 | 69 | |
| Question Answering | NQ | Accuracy87 | 63 | |
| Calibration | NQ | ECE0.046 | 55 | |
| General Question Answering | NQ | Exact Match (EM)54.8 | 52 | |
| Retrieval-Augmented Generation (RAG) | NQ | Reliability Score (RS)54.33 | 52 | |
| Table Question Answering | NQ-Table | F1 Score80.1 | 50 | |
| End-to-end Open-Domain Question Answering | NQ (test) | Exact Match (EM)54 | 50 | |
| Question Answering | NQ | Exact Match72.57 | 46 | |
| Single-Hop Question Answering | NQ | Exact Match (EM)51.7 | 44 | |
| Explaining LLMs | NQ | CRR11.76 | 42 | |
| General QA | NQ | EM40.6 | 38 | |
| Question Answering | NQ | NQ Recall (%)90.6 | 36 | |
| Information Retrieval | NQ320k | Hits@140.4 | 32 | |
| Question Answering | NQ-Open | Exact Match (EM)47.4 | 32 | |
| Question Answering | NQ | F1 Score (NQ)78.8 | 31 |