Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MH-MoE: Multi-Head Mixture-of-Experts

About

Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy77
1820
Visual Question AnsweringGQA
Accuracy75.3
1425
Visual Question AnsweringScienceQA
Accuracy77.4
446
Visual Question AnsweringVQA v2
Accuracy83.6
333
Multimodal EvaluationMM-Vet--
196
Multimodal EvaluationMMStar
Accuracy39.9
139
Visual Question AnsweringMRAG-Bench
Overall Accuracy64.85
14
Showing 7 of 7 rows

Other info

Follow for update