When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

About

Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem• 2026

Related benchmarks

Task	Dataset	Result
Helpfulness alignment	HHH Alignment	Win Rate (WR)97.1	44
LLM Alignment	Harmlessness	WR87.85	27
LLM Alignment	Alpaca, BeaverTails, and TruthfulQA (test)	Win Rate97.1	12
LLM Alignment	Helpfulness	Truthfulness Index0.891	7
LLM Alignment	Honesty	Truthfulness Index84	7
LLM Alignment	Base Model Evaluation Set	Win Rate79.93	6
Computational Efficiency	LLaMA-2-7B (test)	Total Time1.50e+3	4
Honesty-Helpfulness Alignment Evaluation	HoneSet (full)	Win Rate92.1	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord