Share your thoughts, 1 month free Claude Pro on usSee more

Finetuning with implicit harmful data on Identity-shift

53Utility

Safe Lora

Updated 2mo ago

Evaluation Results

Method	Links
Safe Lora 2026.05		53	62	3.27
OpenAI Moderation 2026.05		52	29	1.92
Llamaguard 2026.05		52	53	2.98
No defense 2026.05		51	75	3.75
SafeInstr 2026.05		51	8	1.24
Backdoor 2026.05		51	11	1.37
GradShield 2026.05		51	1	1.01
Base 2026.05		34	0	1