Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LV-Eval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringLV-Eval (test)
EM14.5
19
Multi-hop Question AnsweringLV-Eval (test)
F1 Score12.9
14
Long-context Question AnsweringLV-Eval
F1 Score14.81
14
Long-context understandingLV-Eval 128k
SubEM17.5
9
Long-context understandingLV-Eval 64k
SubEM28.33
9
Long-context understandingLV-Eval 32k
SubEM39.17
9
Long-context understandingLV-Eval 16k
SubEM40
9
Question AnsweringLV-Eval
Average Token Count51,066.2
7
Multi-hop Question AnsweringLV-Eval
Average Running Time (s)1.31
6
RetrievalLV-Eval
Average Running Time (s)0.41
5
Long-context retrieval and reasoningLV-Eval
Performance (16k Context)58.82
5
Long-context language understandingLV-Eval
CMRC (Mixup)7.05
4
Multi-Hop QALV-Eval
EM10.5
3
Showing 13 of 13 rows