Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FLASK

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM-as-a-judge evaluationFLASK
Pearson's r0.518
16
Vulnerability DetectionFLASK
TP5
7
Feedback Evaluation AlignmentFLASK
Kendall's Tau0.405
6
Feedback evaluationFLASK (test)
Kendall's Tau0.385
5
Showing 4 of 4 rows