Share your thoughts, 1 month free Claude Pro on usSee more

Language Model Alignment on Safe RLHF

80.7Win Rate (Helpfulness)

PbCRL

Updated 4mo ago

Evaluation Results

Method	Links
PbCRL 2026.03		80.7	82.1	4.22	3.03
Safe RLHF 2026.03		79.4	76.5	4.05	2.97
PPO 2026.03		72.5	60.7	2.78	-0.57