TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

About

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

Toshiaki Koike-Akino, Jing Liu, Ye Wang• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement100	957
Language Modeling	WT2, PTB, and C4 Macro Average (test)	Perplexity13.1	192
Runtime Speed	Qwen3 Query Projection Module	Throughput (k tokens/sec)92.57	90
Inference Throughput	Qwen3 Query Projection Module NVIDIA A40	Throughput (k tokens/sec)80.63	30
Inference Throughput	Qwen3 Models (test)	Throughput (k tokens/sec)108	30

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord