MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
About
Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.
Anurita Das• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText-2 | Perplexity (PPL)14.01 | 2320 | |
| Common Sense Reasoning | HellaSwag | Accuracy (acc_n)65 | 47 | |
| Generative tasks | 8-task generative suite | Accuracy100 | 21 | |
| Decode Throughput | BenchRandom | Decode Throughput (tok/s)269.1 | 9 | |
| Commonsense Reasoning | HellaSwag n=50 (val) | Accuracy54 | 5 | |
| Language Modeling | WikiText-2 50 seq × 256 tok (test) | PPL17.51 | 5 |
Showing 6 of 6 rows