LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

About

Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant's 3.01x.

Euntae Choi, Sumin Song, Sungjoo Yoo• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity5.8	4085
Language Modeling	C4	Perplexity6.94	1565
Mathematical Reasoning	AIME 25	Pass@1 Accuracy61.3	190
General Reasoning	MMLU-Pro	pass@1 Accuracy66.86	115
Reasoning	GPQA	Pass@157.23	92
Code Generation	LiveCodeBench	Pass@156.59	76
General Reasoning	General Reasoning Suite Average	Pass@168.62	63
Reasoning	LSAT	Pass@185.24	48
Zero-shot Task Evaluation	ARC-C, ARC-E, BoolQ, and HellaSwag	Accuracy69.35	28
Multi-subject Knowledge Reasoning	MMLU-Pro	Pass@171.52	28

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord