Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
About
Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy47.81 | 756 | |
| Question Answering | GPQA | Accuracy29.8 | 258 | |
| Medical Question Answering | MedMCQA (test) | Accuracy66.65 | 134 | |
| Question Answering | MedQA-USMLE (test) | Accuracy76.7 | 101 | |
| Question Answering | GPQA (test) | Accuracy45.45 | 55 | |
| Question Answering | MMLU (test) | Accuracy78.86 | 15 | |
| Question Answering | SuperGPQA (test) | Accuracy35.15 | 15 | |
| Language Understanding | USMLE | Accuracy35.04 | 3 |