You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

About

Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at \href{https://anonymous.4open.science/r/TAQ-9217/README.md}{\includegraphics[height=1em]{imgs/github-mark.png}}.

Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz Ziv, Avi Mendelson• 2025

Related benchmarks

Task	Dataset	Result
Math Reasoning	MMLU-Pro	EM Score42.48	28
Knowledge retrieval	TriviaQA	Exact Match (EM)61.04	28
Code Understanding	CodeMMLU	Exact Match (EM)51.03	28
Large Language Model Inference	Qwen2.5-7B (test)	Throughput37.29	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord