Too Helpful, Too Harmless, Too Honest or Just Right?

About

Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX's generalization across diverse LLM backbones.

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem• 2025

Related benchmarks

Task	Dataset	Result
Helpfulness alignment	HHH Alignment	Win Rate (WR)96.7	44
LLM Alignment	Harmlessness	WR81.5	27
LLM Alignment	Alpaca, BeaverTails, and TruthfulQA (test)	Win Rate96.75	12
LLM Alignment	Honesty	Truthfulness Index63.01	7
LLM Alignment	Helpfulness	Truthfulness Index0.4065	7
LLM Alignment	Base Model Evaluation Set	Win Rate36.75	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord