Our new X account is live! Follow @wizwand_team for updates

Refusal Induction on Refusal Induction held-in and held-out

55Activation Patching

Difference-in-means

Updated 4d ago

Evaluation Results

Method	Links
Difference-in-means 2026.02		55	55	41	58	28
Mean steering 2026.02		48	45	8	18	20
Mean steering 2026.02		48	49	22	34	30
Difference-in-means 2026.02		46	41	27	20	25
ReFT 2026.02		40	34	6	23	5
ReFT 2026.02		24	24	14	31	27
Difference-in-means 2026.02		24	23	20	14	13
Mean steering 2026.02		12	10	12	6	6
ReFT 2026.02		10	8	10	5	5