Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Human Evaluation set

Benchmarks

Task NameDataset NameSOTA ResultTrend
Rebuttal GenerationHuman Evaluation Set (100 comments) 1.0 (test)
Attitude Score9.92
8
Query Auto-CompletionHuman Evaluation Set
Item-wise Score69.9
4
Style CustomizationHuman evaluation set Generated texts (test)
Content Score79
4
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set French short-form
Audio Quality MOS64.5
3
Video GenerationHuman evaluation set 15 videos (test)
Image Prompt Alignment4.4
3
Single-Attribute Controlled Text GenerationHuman Evaluation Set
Quality Score4.2
3
Machine TranslationHuman Evaluation set en-pt 1.0 (test)
Gender Agreement2.78
3
Machine TranslationHuman Evaluation set en-pl 1.0 (test)
Gender Agreement2.64
3
Machine TranslationHuman Evaluation set en-ja 1.0 (test)
Gender Agreement2.97
3
Machine TranslationHuman Evaluation set en-hi 1.0 (test)
Gender Agreement2.89
3
Machine TranslationHuman Evaluation set en-fr 1.0 (test)
Gender Agreement2.96
3
Machine TranslationHuman Evaluation set en-ar 1.0 (test)
Gender Agreement2.79
3
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set German short-form
Audio Quality73.5
2
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set Portuguese short-form
Audio Quality62
2
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set Spanish short-form
Audio Quality (MOS)66.8
2
Showing 15 of 15 rows