Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Enhancing Post-Training Quantization via Future Activation Awareness

About

Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

Zheqi Lv, Zhenxuan Fan, Qi Tian, Wenqiao Zhang, Yueting Zhuang• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)6.2191
2320
Commonsense ReasoningHellaSwag
Accuracy56.08
1896
Language ModelingC4
Perplexity7.6094
1565
Commonsense ReasoningWinoGrande
Accuracy68.19
1442
Question AnsweringARC Challenge
Accuracy50.43
906
Question AnsweringARC Easy
Accuracy80.56
597
Physical Interaction Question AnsweringPIQA
Accuracy78.4
415
Boolean Question AnsweringBoolQ
Accuracy85.29
350
Showing 8 of 8 rows

Other info

Follow for update