| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ELI5 | InSemRAG | ROUGE-L31.15 | 57 | 23h ago | |
| ELI5 (test) | RBG | ROUGE-L27.13 | 54 | 3mo ago | |
| Retrieval-style queries | RAG | Precision87 | 29 | 1mo ago | |
| Summary-style queries | HLTM | Token-F163.5 | 29 | 1mo ago | |
| Novel GraphRAG-Bench | A-RAG (Full) | LLM-Acc85.3 | 20 | 3mo ago | |
| GraphRAG-Bench Med | A-RAG (Full) | LLM Accuracy93.1 | 20 | 3mo ago | |
| ASQA | CLARION | str-em91.18 | 19 | 1mo ago | |
| ExpertQA | NWCAD | ROUGE-L23.34 | 18 | 1mo ago | |
| Biography | EWE | VeriScore F149.7 | 14 | 3mo ago | |
| AlpacaFact | EWE | VeriScore F166.9 | 14 | 3mo ago | |
| Fava | EWE | VeriScore F161 | 14 | 3mo ago | |
| LongFact | EWE | VeriScore F175.9 | 14 | 3mo ago | |
| Long-form QA (test) | ALARM | Win Rate vs. Holistic Reward61.7 | 13 | 3mo ago | |
| ELI5 (val) | F131.5 | 11 | 3mo ago | ||
| ELI5 KILT (test) | RT + C-REALM | F125.4 | 8 | 3mo ago | |
| FetaQA (test) | Chain-of-Table | BLEU32.61 | 7 | 1mo ago | |
| ALCE LFQA | ATTR. FIRST_CoT | ROUGE-L38.6 | 7 | 3mo ago | |
| MM-Telco Telecom Blog | ROUGE-127 | 6 | 1mo ago | ||
| ELI5 standard original | Fourier-BART-FP | RL Score26.9 | 5 | 3mo ago | |
| GroundBench (test) | RHIO-13B | Faithfulness (Full)87.5 | 4 | 3mo ago | |
| LFQA | AIS (Decomposition)90.9 | 4 | 3mo ago | ||
| KILT ELI5 (test) | NTP + NSP | Retrieval Score36.3 | 4 | 3mo ago | |
| HQ2A | Error-Informed Refinement (EIR) | Comprehensiveness100 | 3 | 3mo ago | |
| LFQA (test) | ATTR. FIRST | R-L38.2 | 3 | 3mo ago | |
| KILT ELI5 (dev test) | KID | RL Score26.3 | 3 | 3mo ago |