Context Embeddings for Efficient Answer Generation in RAG

About

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.

David Rau, Shuai Wang, Herv\'e D\'ejean, St\'ephane Clinchant• 2024

Related benchmarks

Task	Dataset	Result
Inference Efficiency	Natural Questions (NQ)	--	90
Long-context Reasoning	LongBench v2	Average Score27.24	88
Long-context Reasoning	Locomo	--	75
Question Answering	PopQA	EM21.72	27
Question Answering	Natural Questions	EM31.86	18
Question Answering	TriviaQA	EM13.75	18
Multi-hop Question Answering	HotpotQA	EM25.9	18
Fact Verification	FactKG	Accuracy60.87	17
Long-context Question Answering	L-Eval QA	NQ61.47	13
Long-context Reasoning	BAMBOO 16k	AltQA Score30.5	13

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord