Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Too Helpful, Too Harmless, Too Honest or Just Right?

About

Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX's generalization across diverse LLM backbones.

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem• 2025

Related benchmarks

TaskDatasetResultRank
LLM AlignmentAlpaca, BeaverTails, and TruthfulQA (test)
Win Rate96.75
12
LLM AlignmentHarmlessness
WR81.5
7
LLM AlignmentHonesty
Truthfulness Index63.01
7
LLM AlignmentHelpfulness
Truthfulness Index0.4065
7
LLM AlignmentBase Model Evaluation Set
Win Rate36.75
6
Showing 5 of 5 rows

Other info

Follow for update