Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongMemEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Memory EvaluationLongMemEval
Average Score95.6
52
Memory-augmented language modeling evaluationLONGMEMEVAL-S
Accuracy73.2
31
Long-term Memory EvaluationLongMemEval S (test)
KU (Knowledge Update)94.4
27
MemoryLongMemEval
Accuracy34.72
25
Long-context Memory Retrieval and ReasoningLongMemEval 1M
F1 Score49.58
20
Long-context Memory Retrieval and ReasoningLongMemEval 128K
F1 Score47.26
20
End-to-End PerformanceLongMemEval
Top-5 Recall59.9
20
Runtime Agent MemoryLongMemEval
F1 Score40.53
20
Long-term Memory RetrievalLongMemEval-S
SSU100
19
RetrievalLongMemEval
Recall@585.8
18
Long-term dialogue memoryLongMemEval (test)
Accuracy85.75
18
RetrievalLongMemEval-S
Recall@594.68
17
Dialogue Memory AccuracyLongMemEval-S (N=500)
Temporal Accuracy91
17
Long-term Memory EvaluationLongMemEvalS
Overall Score95.6
16
Long-term memory performanceLongMemEval S (test)
Accuracy86.4
13
Long-context memory evaluationLongMemEval-s
Overall Score75
12
Question AnsweringLongMemEval s
4o-J Score60.2
11
RetrievalLongMemEval-M
Recall@577.4
10
Long-term Memory Question AnsweringLongMemEval-S (500 questions)
KU Accuracy98.7
9
Question AnsweringLongMemEval-m
4o-J Score46.6
8
RetrievalLongMemEval (session-level)
Ra@580
8
RetrievalLongMemEval_M session-level granularity binary all-or-nothing recall
Recall@580
8
Answer GenerationLongMemEval-s
4o-J Score60.2
8
Conversational MemoryLongMemEval 1.0 (test)
Overall Accuracy92.6
8
Context SelectionLongMemEval session-level (test)
F1 Score0.649
8
Showing 25 of 55 rows