| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Memory Evaluation | LongMemEval | Average Score95.6 | 103 | |
| Long-context Question Answering | LongMemEval LongConvQA | SH Score90.3 | 84 | |
| Long-term conversational memory | LongMemEval Small | LLM Accuracy (%)66.33 | 32 | |
| Memory-augmented language modeling evaluation | LONGMEMEVAL-S | Accuracy73.2 | 31 | |
| Long-term Memory Evaluation | LongMemEval S (test) | KU (Knowledge Update)94.4 | 30 | |
| Retrieval | LongMemEval | Recall@599 | 25 | |
| Memory | LongMemEval | Accuracy34.72 | 25 | |
| Dialogue Memory Accuracy | LongMemEval-S (N=500) | Temporal Accuracy91 | 24 | |
| Long-term Memory Evaluation | LongMemEvalS | Overall Score95.6 | 23 | |
| Memory Question Answering | LongMemEval | Accuracy76 | 22 | |
| Long-context Memory Retrieval and Reasoning | LongMemEval 1M | F1 Score49.58 | 20 | |
| Long-context Memory Retrieval and Reasoning | LongMemEval 128K | F1 Score47.26 | 20 | |
| End-to-End Performance | LongMemEval | Top-5 Recall59.9 | 20 | |
| Runtime Agent Memory | LongMemEval | F1 Score40.53 | 20 | |
| Question Answering | LongMemEval S (test) | QA Score (TR Context)84.21 | 19 | |
| Long-term Memory Retrieval | LongMemEval-S | SSU100 | 19 | |
| Long-term dialogue memory | LongMemEval (test) | Accuracy85.75 | 18 | |
| Retrieval | LongMemEval-S | Recall@594.68 | 17 | |
| Long-term Agent Memory Evaluation | LongMemEval | SS-U95.7 | 15 | |
| Long-term memory performance | LongMemEval S (test) | Accuracy86.4 | 13 | |
| Long-horizon conversation utility evaluation | LongMemEval | Accuracy77.8 | 12 | |
| Long-term Memory | LongMemEval | Score90.8 | 12 | |
| Long-term memory evaluation | LongMemEval S | Single-User Score97.14 | 12 | |
| Question Answering | LongMemEval 500 questions | QA Accuracy61.4 | 12 | |
| Fact recall | LongMemEval (500 questions) | Fact Recall97 | 12 |