You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
About
Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217/README.md.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math Reasoning | MMLU-Pro | EM Score42.48 | 28 | |
| Knowledge retrieval | TriviaQA | Exact Match (EM)61.04 | 28 | |
| Code Understanding | CodeMMLU | Exact Match (EM)51.03 | 28 | |
| Large Language Model Inference | Qwen2.5-7B (test) | Throughput37.29 | 7 |