Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-context Conversational Question Answering on LoCoMo Multi-hop
Loading...
2.89
G-EVAL
EviMem
2.4428
2.5589
2.675
2.7911
Apr 30, 2026
G-EVAL
Judge Accuracy
F1 Score
ROUGE-L
BERTScore
Updated 1mo ago
Evaluation Results
Method
Method
Links
G-EVAL
Judge Accuracy
F1 Score
ROUGE-L
BERTScore
EviMem
System=EviMem (Full)
2026.04
2.89
85.2
26
23.9
85.7
Single-pass
System=Single-pass (La...
2026.04
2.67
81.4
15.1
13.8
83.8
MIRIX
System=MIRIX
2026.04
2.46
65.9
9.9
9.1
83.2
Feedback
Search any
task
Search any
task