Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UltraFeedback

Benchmarks

Task NameDataset NameSOTA ResultTrend
RLHF AlignmentUltraFeedback In-domain v1 (test)
Win Rate81
46
MT-BenchUltraFeedback
MT-Bench Score8.1
42
AlpacaEval 2.0UltraFeedback
LC30
42
Controllable GenerationCode-UltraFeedback
Diversity90.7
36
Generative PerformanceUltrafeedback 61.1k (test)
Win Rate69.8
30
Discriminative PerformanceUltrafeedback 61.1k (test)
Accuracy73.05
30
Reward ModelingUltraFeedback (test)
MAE0.1679
21
LLM AlignmentUltraFeedback (test)
AlpacaEval 2 Win Rate (WR)21
18
Preference AlignmentUltrafeedback 40% flipping ratio
Accuracy78.87
12
Preference AlignmentUltrafeedback 20% flipping ratio
Accuracy78.8
12
Preference AlignmentUltraFeedback (test)
Accuracy74.18
11
Direct Preference OptimizationUltraFeedback
Accuracy69.92
11
AlignmentUltraFeedback (test)
IF Score68.5
11
Correctness AssessmentUltraFeedback Property Constraints Satisfaction (test)
Worst-case Size Distortion (Helpfulness & Instruction-following)0.07
9
Preference Optimization EvaluationUltraFeedback (test_prefs)
Pair Accuracy57.65
8
Reward ModelingUltraFeedback Cleaned
Total Score92.36
8
LLM AlignmentUltraFeedback (in-domain)
Win Rate (KL, alpha=1)80.6
8
Preference PredictionUltraFeedback 500 held-out users (test)
Test Accuracy70.53
7
scoringUltraFeedback
MAE0.623
5
Human EvaluationUltraFeedback 50 sampled questions
Win Rate (Expert 1)62
5
Instruction Followingultrafeedback-prompt (test)
Win Rate52
4
Instruction-followingUltraFeedback
Win Rate80.8
4
LLM AlignmentUltraFeedback 2023 (test)
Win-rate55
4
Format DebiasingUltraFeedback Format-Biased (test)
Win-Rate (Bold)89
4
Model Ranking PredictionUltraFeedback 13B+ Models Holdout (test)
Pairwise Accuracy (RM1_Honest)74.8
4
Showing 25 of 32 rows