Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TL;DR

Benchmarks

Task NameDataset NameSOTA ResultTrend
SummarizationTL;DR
Winrate91.8
59
SummarizationTL;DR (test)
Win Rate82.5
49
Preference AlignmentTL;DR (test)
Win Rate68.8
36
LLM Judgement Confidence EstimationTL;DR (test)
RK0.4269
16
SummarizationTL;DR (distillation set)
Word Count27.24
16
Text SummarizationTL;DR
AlignScore94.2
15
Reward ModelingTL;DR Seen (n=100)
Accuracy62.3
14
LLM-as-a-judgeTL;DR
Coverage82.6
12
Confidence EstimationTL;DR
Rank Correlation (RK)0.421
11
SummarizationTL;DR
Completeness43
11
Reward ModelingTL;DR Overall n=150
Accuracy62.9
7
Reward ModelingTL;DR Unseen (n=150)
Accuracy62.4
7
Reward ModelingTL;DR n=150 Seen
Accuracy63.3
7
Reward ModelingTL;DR n=100 Unseen
Accuracy61.5
7
SummarizationTL;DR
Win Rate92.8
6
Summarization (Groundedness)TL;DR
Kendall's Tau0.46
5
Summarization (Completeness)TL;DR
Kendall's Tau0.44
5
LLM AlignmentTL;DR (test)
Win Rate (GPT-4o)68.56
4
Preference AlignmentTL;DR
GRA (%)64.4
4
SummarizationTL;DR
Winrate50.5
4
Summarization Preference EvaluationTL;DR (val)
Metric-
0
Text SummarizationTL;DR (test)
Metric-
0
Showing 22 of 22 rows