Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

About

Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217/README.md.

Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz Ziv, Avi Mendelson• 2025

Related benchmarks

TaskDatasetResultRank
Math ReasoningMMLU-Pro
EM Score42.48
28
Knowledge retrievalTriviaQA
Exact Match (EM)61.04
28
Code UnderstandingCodeMMLU
Exact Match (EM)51.03
28
Large Language Model InferenceQwen2.5-7B (test)
Throughput37.29
7
Showing 4 of 4 rows

Other info

Follow for update