Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongMemEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Memory EvaluationLongMemEval
Average Score95.6
103
Long-context Question AnsweringLongMemEval LongConvQA
SH Score90.3
84
Long-term conversational memoryLongMemEval Small
LLM Accuracy (%)66.33
32
Memory-augmented language modeling evaluationLONGMEMEVAL-S
Accuracy73.2
31
Long-term Memory EvaluationLongMemEval S (test)
KU (Knowledge Update)94.4
30
RetrievalLongMemEval
Recall@599
25
MemoryLongMemEval
Accuracy34.72
25
Dialogue Memory AccuracyLongMemEval-S (N=500)
Temporal Accuracy91
24
Long-term Memory EvaluationLongMemEvalS
Overall Score95.6
23
Memory Question AnsweringLongMemEval
Accuracy76
22
Long-context Memory Retrieval and ReasoningLongMemEval 1M
F1 Score49.58
20
Long-context Memory Retrieval and ReasoningLongMemEval 128K
F1 Score47.26
20
End-to-End PerformanceLongMemEval
Top-5 Recall59.9
20
Runtime Agent MemoryLongMemEval
F1 Score40.53
20
Question AnsweringLongMemEval S (test)
QA Score (TR Context)84.21
19
Long-term Memory RetrievalLongMemEval-S
SSU100
19
Long-term dialogue memoryLongMemEval (test)
Accuracy85.75
18
RetrievalLongMemEval-S
Recall@594.68
17
Long-term Agent Memory EvaluationLongMemEval
SS-U95.7
15
Long-term memory performanceLongMemEval S (test)
Accuracy86.4
13
Long-horizon conversation utility evaluationLongMemEval
Accuracy77.8
12
Long-term MemoryLongMemEval
Score90.8
12
Long-term memory evaluationLongMemEval S
Single-User Score97.14
12
Question AnsweringLongMemEval 500 questions
QA Accuracy61.4
12
Fact recallLongMemEval (500 questions)
Fact Recall97
12
Showing 25 of 90 rows