| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Memory Evaluation | LongMemEval | Average Score95.6 | 52 | |
| Memory-augmented language modeling evaluation | LONGMEMEVAL-S | Accuracy73.2 | 31 | |
| Long-term Memory Evaluation | LongMemEval S (test) | KU (Knowledge Update)94.4 | 27 | |
| Memory | LongMemEval | Accuracy34.72 | 25 | |
| Long-context Memory Retrieval and Reasoning | LongMemEval 1M | F1 Score49.58 | 20 | |
| Long-context Memory Retrieval and Reasoning | LongMemEval 128K | F1 Score47.26 | 20 | |
| End-to-End Performance | LongMemEval | Top-5 Recall59.9 | 20 | |
| Runtime Agent Memory | LongMemEval | F1 Score40.53 | 20 | |
| Long-term Memory Retrieval | LongMemEval-S | SSU100 | 19 | |
| Retrieval | LongMemEval | Recall@585.8 | 18 | |
| Long-term dialogue memory | LongMemEval (test) | Accuracy85.75 | 18 | |
| Retrieval | LongMemEval-S | Recall@594.68 | 17 | |
| Dialogue Memory Accuracy | LongMemEval-S (N=500) | Temporal Accuracy91 | 17 | |
| Long-term Memory Evaluation | LongMemEvalS | Overall Score95.6 | 16 | |
| Long-term memory performance | LongMemEval S (test) | Accuracy86.4 | 13 | |
| Long-context memory evaluation | LongMemEval-s | Overall Score75 | 12 | |
| Question Answering | LongMemEval s | 4o-J Score60.2 | 11 | |
| Retrieval | LongMemEval-M | Recall@577.4 | 10 | |
| Long-term Memory Question Answering | LongMemEval-S (500 questions) | KU Accuracy98.7 | 9 | |
| Question Answering | LongMemEval-m | 4o-J Score46.6 | 8 | |
| Retrieval | LongMemEval (session-level) | Ra@580 | 8 | |
| Retrieval | LongMemEval_M session-level granularity binary all-or-nothing recall | Recall@580 | 8 | |
| Answer Generation | LongMemEval-s | 4o-J Score60.2 | 8 | |
| Conversational Memory | LongMemEval 1.0 (test) | Overall Accuracy92.6 | 8 | |
| Context Selection | LongMemEval session-level (test) | F1 Score0.649 | 8 |