ROAST: Rollout-based On-distribution Activation Steering Technique

About

Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy92.95	954
Mathematical Reasoning	MATH500 (test)	Accuracy57.79	922
Instruction Following	IFEval	--	854
Natural Language Inference	XNLI	Accuracy48.84	131
Commonsense Reasoning	WinoGrande	Accuracy52.56	94
Instruction Following	IFEval (test)	--	92
Question Answering	TruthfulQA	Accuracy86.64	73
Sentiment Analysis	SST2	Accuracy89.9	17
Sentiment Analysis	SST5	Accuracy (%)48.17	6
Multi-task Language Understanding	MMLU	Accuracy40.73	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord