Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RLHF Backdoor Attack on Anthropic Helpful Harmless prompts (train test)

28.1UHR Rate

Random

11.35615.70320.0524.397Oct 10, 2025
Updated 1d ago

Evaluation Results

MethodLinks
2025.10
28.182.663.4
2025.10
28.186.185.6
2025.10
2887.451.2
2025.10
27.879.853.3
2025.10
27.583.266.1
2025.10
26.86848
2025.10
25.374.958.5
2025.10
257256.8
2025.10
24.367.349.1
2025.10
23.870.550.3
2025.10
23.769.371.7
2025.10
23.666.662.4
2025.10
23.575.576
2025.10
23.57147.8
2025.10
23.468.960.7
2025.10
22.767.538.8
2025.10
22.16950
2025.10
21.57044
2025.10
21.36861.7
2025.10
20.560.555.3
2025.10
17.377.270.5
2025.10
1781.232
2025.10
16.384.540.3
2025.10
15.13541
2025.10
14.54240.5
2025.10
14.54342.2
2025.10
1481.432
2025.10
13.38077.5
2025.10
13.138.424
2025.10
124520