TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
About
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.
Toshiaki Koike-Akino, Jing Liu, Ye Wang• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement93.5 | 700 | |
| Language Modeling | WT2, PTB, and C4 Macro Average (test) | Perplexity13.1 | 192 | |
| Runtime Speed | Qwen3 Query Projection Module | Throughput (k tokens/sec)92.57 | 90 | |
| Inference Throughput | Qwen3 Query Projection Module NVIDIA A40 | Throughput (k tokens/sec)80.63 | 30 | |
| Inference Throughput | Qwen3 Models (test) | Throughput (k tokens/sec)108 | 30 |
Showing 5 of 5 rows