Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context Language UnderstandingLongBench
M-Avg60.31
292
Long-context Language ModelingLongBench
Average Score58.4
164
Long-context language understandingLongBench (test)
Average Score51.87
147
Long-context understandingLongBench (test)
Avg Score58.7
136
Long-context understandingLongBench
Overall Average Score62.1
115
Long Context UnderstandingLongBench V2
Overall Score65.6
109
Long-context Language UnderstandingLongBench
Average Score58.4
86
Long-context understandingLongBench
HotpotQA57.15
82
Single-Doc Question AnsweringLongBench
MultifieldQA Score53.67
75
Long-context Question AnsweringLongBench (test)
HotpotQA7,011
69
Long-context ReasoningLongBench
Score73.8
62
Long-context language understandingLongBench 1.0 (test)
MultiNews61.5
61
Long-context UnderstandingLongBench
Accuracy103
60
Long-context EvaluationLongBench
Average Score31.96
57
Long document retrievalLongBench Retrieval v2 (full)
F1 Score0.4843
55
Few-shot LearningLongBench
TREC Score82.62
51
SummarizationLongBench
GovRep Score33.39
51
Long-context ReasoningLongBench v2
Average Score68.2
48
Long-context language understandingLongBench v2
Overall Accuracy46.32
47
Long-context ReasoningLongBench
Accuracy (LongBench)68.7
45
Multi-Document Question AnsweringLongBench
HotpotQA Acc57.67
45
Code AnalysisLongBench
Lcc Score70.64
43
Synthetic TasksLongBench
PCount14.67
43
Multi-choice Question AnsweringLongBench v2
Overall Accuracy46.5
41
Long-context language modelingLongBench-E 1.0 (test)
S-Doc QA Perf.49.92
37
Showing 25 of 132 rows