| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Reasoning | BABILong 16k | Accuracy29.8 | 72 | |
| Long-context Reasoning | BABILong 8k | Accuracy34.7 | 65 | |
| Long-context Reasoning | BABILong 4k | Accuracy (BABILong 4k)38.5 | 51 | |
| Question Answering | Babilong 16k context length | QA1 Accuracy58 | 9 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 8K | Accuracy57 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 4K | Accuracy61.2 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 2K | Accuracy65.2 | 6 | |
| Long-context reasoning | BABILong | Err (2k Context)14.1 | 6 | |
| Question Answering | Babilong 128k context length | QA1 Score38 | 5 | |
| Question Answering | Babilong 64k context length | QA1 Score25 | 5 | |
| Needle-in-a-Haystack Retrieval | BABILong 32K context length | Accuracy9 | 3 | |
| Needle-in-a-Haystack Retrieval | BABILong 16K context length | Needle-in-a-Haystack Accuracy (16K)22.2 | 3 |