Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RM-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingRM-Bench
Accuracy96
125
Reward ModelingRM-Bench (test)
Overall Score96
63
Reward Modeling EvaluationRM-Bench
Chat Score75.6
55
Reward ModelingRM Bench Code
EF0.154
52
Reward ModelingRM-Bench Chat Hard
Accuracy83.3
34
Reward ModelingRM-Bench v1.0 (test)
Overall Score74.3
29
Reward Modeling Suitability EvaluationRM Bench Math
EF-0.077
26
Reward Modeling Suitability EvaluationRM Bench Safety-accept
EF0.698
26
Reward Model Suitability AuditRM Bench Chat
EF0.313
26
Reward ModelingRM-Bench Chat
Accuracy78.5
18
Reward ModelingRM-Bench Chat subset Normal
Accuracy86
16
Reward ModelingRM-Bench (full)
Chat Score83
11
Preference PredictionRM-Bench
Accuracy87.8
10
Reward ModelingRM-Bench Hard
Accuracy0.697
10
Reward ModelingRM-Bench Normal
Accuracy80
10
Reward ModelingRM-Bench Easy
Accuracy92.2
10
Reward ModelingRM-Bench 1k
Positional Consistency73.5
8
Showing 17 of 17 rows