Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

About

Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.

Maxime Louis, Herv\'e D\'ejean, St\'ephane Clinchant• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringHotpotQA
EM0.95
18
Question AnsweringNatural Questions
EM0.06
18
Question AnsweringTriviaQA
EM0.8
18
Fact VerificationFactKG
Accuracy65.78
17
Question AnsweringPopQA
EM0.44
17
Question AnsweringGeneral Domain QA (ASQA, HotpotQA, NQ, TriviaQA, POPQA)
ASQA Score78
12
Inference EfficiencyNatural Questions (NQ)
TIL (ms)502
12
Inference EfficiencyHotpotQA
Time to Last Token (ms)553
12
Question AnsweringRobustQA
Bio QA Recall29
6
Question AnsweringMultilingual QA
FR Recall (3-gram)60
6
Showing 10 of 10 rows

Other info

Code

Follow for update