| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context language modeling evaluation | HELMET | Average Sparsity0 | 28 | |
| Long-context language modeling | HELMET | Summarization Score247 | 27 | |
| Long-context Understanding | HELMET 2025 | Accuracy (8K Context)61.44 | 16 | |
| Long-context language modeling evaluation | HELMET held-out eval | Accuracy (8K Context)57.61 | 13 | |
| Question Answering | HELMET RAG subset | HotpotQA Accuracy81.1 | 8 | |
| Long-context Multimodal Understanding | Helmet | Accuracy67.6 | 6 | |
| Long Context Understanding | HELMET | Accuracy64.7 | 4 |