No Mean Feat: Simple, Strong Baselines for Context Compression
About
Context compression reduces Transformer inference costs by replacing lengthy inputs with shorter pre-computed representations. It carries significant benefits for retrieval-augmented generation (RAG) and has attracted growing research attention. However, progress remains difficult to measure due to inconsistent evaluations and baselines. We design a standard, easy-to-reproduce evaluation suite for context compression, BenchPress, along with simple, high-performance baselines for English reading comprehension. BenchPress supports benchmarking across model scales, datasets, compression ratios, and short ($<$1K tokens) to mid-range ($<$8K tokens) contexts. While the suite is applicable to any compression paradigm, our baselines target soft context compression. We establish two simple baselines that strongly outperform the widely used causal compression-token approach: mean pooling and a bidirectional compression-token variant. Our results show the benefit of bidirectional attention when computing compressed representations, and that simple pooling is an expressive compression operator.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Model Evaluation | BenchPress short-context (test) | Accuracy65.91 | 131 | |
| Context Compression Evaluation | BenchPress suite macro-averaged across all datasets | Macro-averaged F171.66 | 130 | |
| Context Compression | BenchPress short-context (test) | EM (4x Single Context)56.41 | 21 | |
| Multi-Doc Question Answering | LongBench-E Multi-Doc QA | F1 Score45.9 | 17 | |
| Single-Doc Question Answering | LongBench-E Single-Doc QA | F1 Score39.7 | 17 |