H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

About

The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/git-disl/h3fusion.

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu• 2024

Related benchmarks

Task	Dataset	Result
Helpfulness alignment	HHH Alignment	Win Rate (WR)80	44
LLM Alignment	Harmlessness	WR59.86	27
LLM Alignment	Alpaca, BeaverTails, and TruthfulQA (test)	Win Rate80	12
LLM Alignment	Honesty	Truthfulness Index41.1	7
LLM Alignment	Helpfulness	Truthfulness Index0.2689	7
LLM Alignment	Base Model Evaluation Set	Win Rate13.79	6
Computational Efficiency	LLaMA-2-7B (test)	Total Time7.26e+3	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord