Anthropic Helpful Harmless prompts

Benchmarks

Task Name	Dataset Name	SOTA Result	Trend
RLHF Backdoor Attack	Anthropic Helpful Harmless prompts (train test)	UHR Rate28.1		30

Showing 1 of 1 rows