We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

About

Alignment of Large Language Models (LLMs) is the ability to satisfy desired objectives during generation, which is critical for trustworthy deployment. In practice, alignment is often operationalized through multiple objectives such as Helpfulness, Harmlessness, and Honesty (HHH). Prior works study alignment via steering vectors in standard Transformer decoders but treat objectives in isolation, where optimizing a single objective can overwrite others, leading to interference. Recent works attempt to address this limitation by extending steering to a 1-to-N Transformer setting by replicating representations into objective-specific pathways, but apply transformations independently, resulting in inconsistent responses across objectives. Similarly, approaches such as safe RLHF and MoE-based designs study trade-offs across objectives but do not constrain objective-specific transformations within a shared representation during inference. As a result, even aligned State-of-the-Art (SOTA) LLMs can struggle to jointly satisfy HHH objectives in complex settings. To address this, we propose Adaptive Multi-Branch Steering (AMBS), a two-stage framework in a 1-to-N Transformer setting that parameterizes objective-specific transformations relative to a shared representation. In Stage I, a shared hidden representation is computed once. In Stage II, this representation is replicated into N pathways and updated relative to a shared reference, capturing objective-specific deviations while restricting divergence. This produces N objective-specific responses within a single forward pass, which can be combined at decoding to obtain a single response across objectives. Across multiple backbones, AMBS improves performance across HHH, with consistent gains in WR, TI, and SS (e.g., Avg 56.5% on LLaMA-2-7B) while maintaining efficiency (e.g., 189 Tok/s, 9 GPU-hrs).

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem• 2025

Related benchmarks

Task	Dataset	Result
Helpfulness alignment	HHH Alignment	Win Rate (WR)98.6	44
LLM Alignment	Harmlessness	WR86.2	27
Honesty Alignment	HHH Alignment	Win Rate (WR)86.1	20

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord