Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

About

LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM families while incurring minimum quality degradation. In the end-to-end inference serving, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B. SwiftKV is open-sourced at https://github.com/snowflakedb/arctictraining.

Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He• 2024

Related benchmarks

TaskDatasetResultRank
Prefill KV-cache memory measurementTULU-3 (dev)
Active KV-cache Memory (GiB)0.123
32
Stage-aware PrefillTULU-3 (dev)
Total FLOPs (teraFLOPs)13.18
32
PrefillStage-aware Prefill
TTFT (ms)74.36
32
Text GenerationStage-aware Prefill 16K Prompt
TPOT (ms/token)25.69
4
Text GenerationStage-aware Prefill 32K Prompt
Latency (ms/token)25.59
4
Text GenerationStage-aware Prefill 64K Prompt
TPOT (ms/token)30.02
4
Text GenerationStage-aware Prefill 128K Prompt
TPOT (ms/token)47.35
4
Text GenerationStage-aware Prefill 1K Prompt
TPOT (ms/token)26.38
4
Text GenerationStage-aware Prefill 2K Prompt
TPOT (ms/token)25.83
4
Text GenerationStage-aware Prefill 4K Prompt
TPOT (ms/token)25.65
4
Showing 10 of 11 rows

Other info

Follow for update