Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Manual evaluation set

Benchmarks

Task NameDataset NameSOTA ResultTrend
Helpfulness EvaluationManual Evaluation Set
Average Helpfulness Score4.57
24
Safety EvaluationManual Evaluation Set
Average Safety Score3.83
12
Actionable Suggestion ExtractionManual evaluation set 1.0 (test)
BERTScore92
4
Showing 3 of 3 rows