| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Open Question Answering | Natural Questions (NQ) (test) | Exact Match (EM)58.4 | 134 | |
| Retrieval Attack Defense | Natural Questions (NQ) | ASR0 | 99 | |
| Inference Efficiency | Natural Questions (NQ) | Relative Overhead (%)0.019 | 90 | |
| Open Domain Question Answering | Natural Questions (NQ) | Exact Match (EM)60.7 | 74 | |
| Over-refusal Evaluation | NQ (Natural Questions) | ORR0 | 72 | |
| Question Answering | Natural Questions (test) | EM61.65 | 72 | |
| Question Answering | NQ (Natural Questions) | EM78.3 | 70 | |
| Question Answering | Natural Questions (NQ) (test) | Exact Match76 | 68 | |
| Retrieval | Natural Questions (test) | Top-5 Recall92.1 | 62 | |
| Question Answering | NQ (Natural Questions) (test) | Accuracy68.6 | 60 | |
| Question Answering | Natural Questions | EM70.58 | 52 | |
| Question Answering | Natural Questions (NQ) | Accuracy49.3 | 48 | |
| Question Answering | Natural Questions (NQ) (test) | Robust Accuracy68 | 45 | |
| Knowledge Evaluation | Natural Questions (NQ) (Evaluation) | Accuracy83 | 45 | |
| Passage retrieval | Natural Questions (NQ) (test) | Top-20 Accuracy85.2 | 45 | |
| Embedding Alignment | Natural Questions (test) | Top-1 Accuracy100 | 40 | |
| Open-QA Evaluation | EVOUNA-NaturalQuestions | F1 Score97.9 | 35 | |
| Honesty Alignment | Natural Questions (NQ) In-Domain | AUROC85.16 | 33 | |
| Single-hop Question Answering | Natural Questions (NQ) (test) | EM47.5 | 33 | |
| Open-Domain Question Answering | NQ (Natural Questions) | EM51.4 | 33 | |
| Question Answering | NQ (Natural Questions) | EM42.5 | 28 | |
| Passage Retrieval | Natural Questions (NQ) | Top-10 Accuracy66.59 | 28 | |
| Closed-book Question Answering | Natural Questions (test) | Accuracy29.9 | 27 | |
| Question Answering | Natural Questions (test) | Speedup Ratio2.916 | 26 | |
| Information Retrieval | Natural Questions (test) | Recall@2086.1 | 25 |