Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Context Compression for Auto-regressive Transformers with Sentinel Tokens

About

The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at https://github.com/DRSY/KV_Compression.

Siyu Ren, Qi Jia, Kenny Q. Zhu• 2023

Related benchmarks

TaskDatasetResultRank
Stream GenerationStreamChat sequence length (4096, 4096×2]
BLEU18.3
8
Stream GenerationStreamChat sequence length (4096×2, 4096×3]
BLEU Score16.4
8
Stream GenerationStreamChat sequence length (4096×3, 4096×4]
BLEU16.9
8
Natural language generationMATH
BLEU32.2
8
Stream GenerationStreamChat sequence length (0, 4096]
BLEU21.2
8
Natural language generationUltraChat
BLEU21.1
8
Natural language generationEverythingLM
BLEU18.2
8
Showing 7 of 7 rows

Other info

Follow for update