Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Harmlessness Evaluation on HH-RLHF

94.5Harmlessness Rate

SFT + TTL + DPO + TPO

90.02891.18992.3593.511May 8, 2026
Updated 23d ago

Evaluation Results

MethodLinks
2026.05
94.5
94.1
2026.05
93.5
2026.05
92.8
2026.05
90.2
2026.05
90.2