Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Language Model Alignment on Safe RLHF
Loading...
80.7
Win Rate (Helpfulness)
PbCRL
72.172
74.386
76.6
78.814
Mar 24, 2026
Win Rate (Helpfulness)
Win Rate (Harmlessness)
Reward
Cost
Updated 23d ago
Evaluation Results
Method
Method
Links
Win Rate (Helpfulness)
Win Rate (Harmlessness)
Reward
Cost
PbCRL
Base Model=Llama-3.2-1...
2026.03
80.7
82.1
4.22
3.03
Safe RLHF
Base Model=Llama-3.2-1...
2026.03
79.4
76.5
4.05
2.97
PPO
Base Model=Llama-3.2-1...
2026.03
72.5
60.7
2.78
-0.57
Feedback
Search any
task
Search any
task