Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FLASK

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM-as-a-judge evaluationFLASK
Pearson's r0.589
36
Direct AssessmentFlask
Pearson Correlation Coefficient0.7203
12
Vulnerability DetectionFLASK
TP5
7
Feedback Evaluation AlignmentFLASK
Kendall's Tau0.405
6
Feedback evaluationFLASK (test)
Kendall's Tau0.385
5
Showing 5 of 5 rows