Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Anthropic Helpful Harmless prompts

Benchmarks

Task NameDataset NameSOTA ResultTrend
RLHF Backdoor AttackAnthropic Helpful Harmless prompts (train test)
UHR Rate28.1
30
Showing 1 of 1 rows