PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

About

Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.

Maxime Louis, Herv\'e D\'ejean, St\'ephane Clinchant• 2025

Related benchmarks

Task	Dataset	Result
Language Model Evaluation	BenchPress short-context (test)	Accuracy53.62	131
Question Answering	NQ	Accuracy65	123
Question Answering	TriviaQA	Accuracy90	117
Question Answering	PopQA	Accuracy64	103
Inference Efficiency	Natural Questions (NQ)	--	90
Question Answering	BioASQ	Accuracy49	72
Question Answering	ASQA	Accuracy71	59
Question Answering	PopQA	EM0.44	27
Question Answering	ASQA, HotpotQA, NQ, TriviaQA, POPQA, BIOASQ	ASQA Score81	20
Multi-hop Question Answering	HotpotQA	EM0.95	18

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord