Safety-Preserving PTQ via Contrastive Alignment Loss

About

Post-Training Quantization (PTQ) has become the de-facto standard for efficient LLM deployment, yet its optimization objective remains fundamentally incomplete. Standard PTQ methods minimize reconstruction error (e.g., MSE or KL divergence) without accounting for behavioral alignment--a critical property instilled through safety fine-tuning. We demonstrate that this objective mismatch introduces a systematic vulnerability: models can maintain low perplexity yet exhibit significant degradation in safety alignment, revealing that perplexity alone is an insufficient and often misleading proxy for deployment readiness. To address this, we propose Contrastive Alignment Quantization (CAQ), which extends the PTQ objective design space by integrating a Contrastive Alignment Loss (CAL). CAL introduces a principled push-pull mechanism that jointly optimizes distributional fidelity and behavioral alignment: it steers the quantized model toward its safe, instruction-tuned counterpart while diverging from the unaligned, pre-trained reference. CAQ requires no specialized safety datasets, relying solely on standard calibration data, and introduces negligible computational overhead over existing transformation-based PTQ pipelines. We show that CAQ enables robust 4-bit (W4A4) quantization across diverse model families--including LLaMA, Qwen, and Mistral--achieving superior safety alignment where state-of-the-art PTQ methods fail, without sacrificing general capabilities. Anonymized code is available in the supplementary material.

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	Perplexity	Perplexity (PPL)7.56	149
Safety Evaluation	SafetyBench	Safety66.8	26
Zero-shot Task Evaluation	11 Tasks zero-shot	0-shot Average66.37	26

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord