Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

About

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Vyom Sharma, Debajyoti Datta• 2026

Related benchmarks

TaskDatasetResultRank
Kernel Throughput EvaluationMoE Models (OLMoE, Qwen3, DSv3, Mixtral) beta=0.5
Latency67
12
End-to-end LLM Inference ServingLong-context 1024-token input, 32-token output
TPOT Speedup vs DeepGEMM1.48
3
End-to-end LLM Inference ServingShareGPT
TPOT Speedup vs DeepGEMM1.5
2
Showing 3 of 3 rows

Other info

Follow for update