| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Question Answering | LoCoMo | F1 (Multi Hop)50.25 | 171 | |
| Long-term memory evaluation | LoCoMo | Overall F193 | 128 | |
| Multi-hop Question Answering | LoCoMo | F148.35 | 125 | |
| Single-hop Question Answering | LoCoMo | F10.6408 | 111 | |
| Open-domain Question Answering | LoCoMo | F10.4013 | 111 | |
| Temporal Question Answering | LoCoMo | F10.6634 | 85 | |
| Long-context Memory Retrieval | LoCoMo | Single-hop97.1 | 80 | |
| Long-context Reasoning | LoCoMo | Average F195.06 | 75 | |
| Multi-Hop Reasoning | LoCoMo | F1 Score41.62 | 68 | |
| Long-term conversational memory | LoCoMo | Overall Acc (LoCoMo)79.44 | 59 | |
| Long-context Conversational Question Answering | LoCoMo | Multi-Hop F143.1 | 59 | |
| Long-context Management | LoCoMo | F1 Score65.2 | 57 | |
| Overall Reasoning (Average) | LoCoMo | F1 Score (LoCoMo)44.68 | 52 | |
| Open-Domain | LoCoMo | F1 Score48.38 | 51 | |
| Single-Hop | LoCoMo | F1 Score59.03 | 47 | |
| Temporal | LoCoMo | F1 Score0.4935 | 47 | |
| Question Answering | LoCoMo | Single Hop F167.13 | 45 | |
| Long-context Question Answering | LoCoMo | Single-Hop LLJ Score97.1 | 45 | |
| Temporal Reasoning | LoCoMo | F1 Score65.06 | 45 | |
| Long-context reasoning and retrieval | LoCoMo (test) | Single-Hop F195.12 | 37 | |
| Conversation Question Answering | LOCOMO (test) | RAG F136.75 | 36 | |
| Long-form Dialogue | LoCoMo | EM37.24 | 32 | |
| Single-Hop Reasoning | LoCoMo | F1 Score57.55 | 28 | |
| Long-term Question Answering | LoCoMo | Multi-Hop F144.24 | 27 | |
| Long-term dialogue memory | LoCoMo (test) | Accuracy84.23 | 27 |