| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long Form Question Answering | ELI5 (test) | ROUGE-L27.13 | 54 | |
| Watermarking | eli5-category (test) | PPL1.313 | 28 | |
| Long Form Question Answering | ELI5 | ROUGE-L25.57 | 27 | |
| Attributed Text Generation | ELI5 | Claim Correctness Score25.9 | 19 | |
| Text generation quality and watermark detectability | ELI5 | AUC100 | 16 | |
| Word-level length-controlled question answering | ELI5 (test) | MAE14.1 | 14 | |
| Question Answering | ELI5 Wiki-answerable | ROUGE-L Score26.6 | 14 | |
| Sentence-level length-controlled question answering | ELI5 (test) | MAE0.08 | 12 | |
| Token-level length-controlled question answering | ELI5 (test) | MAE10.33 | 12 | |
| Long-Form Question Answering | ELI5 (val) | F131.5 | 11 | |
| Sentence-level attribution | ELI5 (test) | Citation Recall81.9 | 10 | |
| Grounded Generation | ELI5 (test) | Fluency (mauve)47.43 | 10 | |
| Knowledge-intensive generation | ELI5 (dev) | ROUGE-L Score26.6 | 9 | |
| Long-form Question Answering | ELI5 KILT (test) | F125.4 | 8 | |
| Long-form Question Answering with Citations | ELI5 | Correctness0.186 | 8 | |
| Retrieval | ELI5 KILT (test) | Retrieval Precision11 | 8 | |
| Open-domain QA | ELI5 | R-L20.75 | 8 | |
| Fine-grained passage-level retrieval | ELI5 | Answer in Context16.85 | 7 | |
| Instruction-following | ELI5 prompts Gemma-7B-it 200 tokens (test) | Perplexity1.7784 | 6 | |
| Question Answering | ELI5 | B-1 Score27.9 | 6 | |
| Attributed Question Answering | ELI5 (test) | Rouge-L21.3 | 5 | |
| Long-form Question Answering refinement | ELI5 (test) | Error Rate0.0381 | 5 | |
| Long-form Question Answering | ELI5 standard original | RL Score26.9 | 5 | |
| Natural Language Generation | ELI5 | Relevance (Mean)5.14 | 5 | |
| Knowledge-grounded Generation | ELI5 ALCE (test) | Correctness10.5 | 4 |