Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reward Modeling on RewardBench OOD Evaluation
Loading...
99.4
Chat
FsfairX-Llama3-RM-v0.1
95.968
96.859
97.75
98.641
May 17, 2025
Chat
Chat Hard
Safety
Reasoning
Average
Updated 1mo ago
Evaluation Results
Method
Method
Links
Chat
Chat Hard
Safety
Reasoning
Average
FsfairX-Llama3-RM-v0.1
Backbone=Llama3, Versi...
2025.05
99.4
65.1
87.8
86.4
84.7
Mutual-Taught
Iteration=1
2025.05
98.3
63.9
85.1
95.8
85.8
Mutual-Taught
Iteration=2
2025.05
98.2
66.3
87.8
95.7
87
GPT-4o-2024-08-06
Version=2024-08-06
2025.05
96.1
76.1
88.1
86.6
86.7
Feedback
Search any
task
Search any
task