Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

About

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

Toshiaki Koike-Akino, Jing Liu, Ye Wang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement93.5
700
Language ModelingWT2, PTB, and C4 Macro Average (test)
Perplexity13.1
192
Runtime SpeedQwen3 Query Projection Module
Throughput (k tokens/sec)92.57
90
Inference ThroughputQwen3 Query Projection Module NVIDIA A40
Throughput (k tokens/sec)80.63
30
Inference ThroughputQwen3 Models (test)
Throughput (k tokens/sec)108
30
Showing 5 of 5 rows

Other info

Follow for update