RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

About

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Vyom Sharma, Debajyoti Datta• 2026

Related benchmarks

Task	Dataset	Result
Kernel Throughput Evaluation	MoE Models (OLMoE, Qwen3, DSv3, Mixtral) beta=0.5	Latency67	12
End-to-end LLM Inference Serving	Long-context 1024-token input, 32-token output	TPOT Speedup vs DeepGEMM1.48	3
End-to-end LLM Inference Serving	ShareGPT	TPOT Speedup vs DeepGEMM1.5	2

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord