H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
About
The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/git-disl/h3fusion.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM Alignment | Alpaca, BeaverTails, and TruthfulQA (test) | Win Rate80 | 12 | |
| LLM Alignment | Harmlessness | WR59.86 | 7 | |
| LLM Alignment | Honesty | Truthfulness Index41.1 | 7 | |
| LLM Alignment | Helpfulness | Truthfulness Index0.2689 | 7 | |
| LLM Alignment | Base Model Evaluation Set | Win Rate13.79 | 6 | |
| Computational Efficiency | LLaMA-2-7B (test) | Total Time7.26e+3 | 4 |