| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context language model evaluation | HELMET | Average Score55.2 | 39 | |
| Long-context language modeling evaluation | HELMET | Average Sparsity0 | 28 | |
| Long-context language modeling | HELMET | Summarization Score247 | 27 | |
| Long-context Understanding | HELMET 2025 | Accuracy (8K Context)61.44 | 16 | |
| Long Context Understanding | HELMET | Accuracy68.5 | 15 | |
| Long-context language modeling evaluation | HELMET held-out eval | Accuracy (8K Context)57.61 | 13 | |
| Question Answering | HELMET RAG subset | HotpotQA Accuracy81.1 | 8 | |
| Long-context Multimodal Understanding | Helmet | Accuracy67.6 | 6 | |
| Long-context Understanding | HELMET shorter context lengths ≤32K | Score (8K Context)59.34 | 4 | |
| Holistic long-context understanding | HELMET Holistic understanding | HELMET Holistic Understanding (64K Context)46.51 | 4 |