Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bio

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringBio (test)
LLM-Judge Score82.9
105
Question Answering under PIA attackBio
Accuracy75.2
60
Long-form GenerationBio
LLM-Judge Score81
59
Factuality CorrectionBIO (test)
Precision51
44
Retrieval-Augmented GenerationBio
Accuracy74.02
42
Uncertainty QuantificationBIO
PCC-0.129
32
Dynamic Retrieval-Augmented GenerationBio (test)
Accuracy83.1
24
Factuality CorrectionBIO dataset
Factual Precision93
24
Conformal Predictionbio (test)
Marginal Coverage90
19
Question AnsweringBio
Few-Shot Accuracy84.3
17
Long-form Biography GenerationBio FactScore
FactScore81.2
17
Question AnsweringBio poison @ Position 10, k=10 (test)
Robustness Score (LLM-J)79.9
15
Question AnsweringBio poison @ Position 1, k=10 (test)
Rob. LLM-J Score79.3
15
Scientific Reasoningbio
Pass Rate33.9
14
Topic ModelingBio
IRBO100
13
Topic ModelingBio
NPMI0.191
13
Document ClusteringBio (test)
NMI0.557
13
Tabular ClassificationBIO M (test)
Macro F180.1
9
Regressionbio
Coverage90.57
8
Factuality EvaluationBIO (test)
FS Score88.9
8
AMR ParsingBIO
Smatch62.8
8
Factuality EvaluationBio
Precision14.1
6
Long-form generationBio
PIA RLLMJ Score69.8
6
Retrieval Question AnsweringBio
MRR0.15
6
Conjunctive Query AnsweringBio queries (test)
AUC91
6
Showing 25 of 30 rows