MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

About

Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

Anurita Das• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)14.01	2862
Common Sense Reasoning	HellaSwag	Accuracy (acc_n)65	47
Generative tasks	8-task generative suite	Accuracy100	21
Decode Throughput	BenchRandom	Decode Throughput (tok/s)269.1	9
Commonsense Reasoning	HellaSwag n=50 (val)	Accuracy54	5
Language Modeling	WikiText-2 50 seq × 256 tok (test)	PPL17.51	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord