Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UltraFeedback

Benchmarks

Task NameDataset NameSOTA ResultTrend
RLHF AlignmentUltraFeedback In-domain v1 (test)
Win Rate81
46
MT-BenchUltraFeedback
MT-Bench Score8.1
42
AlpacaEval 2.0UltraFeedback
LC30
42
Controllable GenerationCode-UltraFeedback
Diversity90.7
36
Generative PerformanceUltrafeedback 61.1k (test)
Win Rate69.8
30
Discriminative PerformanceUltrafeedback 61.1k (test)
Accuracy73.05
30
Preference AlignmentUltrafeedback 40% flipping ratio
Accuracy78.87
12
Preference AlignmentUltrafeedback 20% flipping ratio
Accuracy78.8
12
AlignmentUltraFeedback (test)
IF Score68.5
11
LLM AlignmentUltraFeedback (in-domain)
Win Rate (KL, alpha=1)80.6
8
Preference PredictionUltraFeedback 500 held-out users (test)
Test Accuracy70.53
7
Human EvaluationUltraFeedback 50 sampled questions
Win Rate (Expert 1)62
5
LLM AlignmentUltraFeedback 2023 (test)
Win-rate55
4
Format DebiasingUltraFeedback Format-Biased (test)
Win-Rate (Bold)89
4
Model Ranking PredictionUltraFeedback 13B+ Models Holdout (test)
Pairwise Accuracy (RM1_Honest)74.8
4
Model Ranking PredictionUltraFeedback 30B+ Models Holdout (test)
Pairwise Acc (RM1_Honest)77.3
4
Model Ranking PredictionUltraFeedback 70B+ Models Holdout (test)
Pairwise Acc (RM1_Honest)77.4
4
Pairwise Preference RankingUltraFeedback 10% holdout (test)
Pairwise Accuracy (RM1-Honest)86.3
4
Pairwise Preference RankingUltraFeedback 5% holdout (test)
Pairwise Accuracy (RM1-Honest)87
4
Pairwise Preference RankingUltraFeedback 2% holdout (test)
Pairwise Acc (RM1-Honest)89.1
4
Showing 20 of 20 rows