| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | NQ | AUC0.889 | 154 | |
| Question Answering | NQ (test) | EM Accuracy66.4 | 133 | |
| Question Answering | NQ | Accuracy70 | 123 | |
| Question Answering | NQ | Accuracy87 | 113 | |
| Question Answering | NQ | Exact Match74.2 | 101 | |
| Hallucination Detection | NQ (test) | AUC ROC95.2 | 91 | |
| Question Answering | NQ (test) | AUROC83 | 90 | |
| Question Answering | NQ | Absolute Execution Time Overhead (s)0.064 | 90 | |
| Question Answering | NQ | PRR0.65 | 90 | |
| RAG Performance Prediction | NQ-Open | QE5 Score0.793 | 80 | |
| Open-Domain Question-Answering | NQ | Accuracy61.6 | 74 | |
| Question Answering | NQ | ACE Score0.496 | 70 | |
| Question Answering | NQ | ASR99.65 | 70 | |
| Question Answering | NQ | EM79 | 69 | |
| Question Answering | NQ | F1 Score (NQ)78.8 | 64 | |
| Table Question Answering | NQ-Table | F1 Score80.1 | 63 | |
| Single-Hop Question Answering | NQ | Exact Match (EM)51.7 | 60 | |
| End-to-end Open-Domain Question Answering | NQ (test) | Exact Match (EM)55.1 | 59 | |
| Question Answering | NQ | F1 Score44.11 | 56 | |
| Calibration | NQ | ECE0.046 | 55 | |
| General QA | NQ | EM46.9 | 54 | |
| Information Retrieval | NQ320k | Hits@148.92 | 54 | |
| General Question Answering | NQ | Exact Match (EM)54.8 | 52 | |
| Retrieval-Augmented Generation (RAG) | NQ | Reliability Score (RS)54.33 | 52 | |
| Retrieval-Augmented Question Answering | NQ | Clean Accuracy89 | 45 |