PISCO: Pretty Simple Compression for Retrieval-Augmented Generation
About
Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | EM0.95 | 18 | |
| Question Answering | Natural Questions | EM0.06 | 18 | |
| Question Answering | TriviaQA | EM0.8 | 18 | |
| Fact Verification | FactKG | Accuracy65.78 | 17 | |
| Question Answering | PopQA | EM0.44 | 17 | |
| Question Answering | General Domain QA (ASQA, HotpotQA, NQ, TriviaQA, POPQA) | ASQA Score78 | 12 | |
| Inference Efficiency | Natural Questions (NQ) | TIL (ms)502 | 12 | |
| Inference Efficiency | HotpotQA | Time to Last Token (ms)553 | 12 | |
| Question Answering | RobustQA | Bio QA Recall29 | 6 | |
| Question Answering | Multilingual QA | FR Recall (3-gram)60 | 6 |