No Mean Feat: Simple, Strong Baselines for Context Compression

About

Context compression reduces Transformer inference costs by replacing lengthy inputs with shorter pre-computed representations. It carries significant benefits for retrieval-augmented generation (RAG) and has attracted growing research attention. However, progress remains difficult to measure due to inconsistent evaluations and baselines. We design a standard, easy-to-reproduce evaluation suite for context compression, BenchPress, along with simple, high-performance baselines for English reading comprehension. BenchPress supports benchmarking across model scales, datasets, compression ratios, and short ($<$1K tokens) to mid-range ($<$8K tokens) contexts. While the suite is applicable to any compression paradigm, our baselines target soft context compression. We establish two simple baselines that strongly outperform the widely used causal compression-token approach: mean pooling and a bidirectional compression-token variant. Our results show the benefit of bidirectional attention when computing compressed representations, and that simple pooling is an expressive compression operator.

Yair Feldman, Yoav Artzi• 2025

Related benchmarks

Task	Dataset	Result
Language Model Evaluation	BenchPress short-context (test)	Accuracy65.91	131
Context Compression Evaluation	BenchPress suite macro-averaged across all datasets	Macro-averaged F171.66	130
Context Compression	BenchPress short-context (test)	EM (4x Single Context)56.41	21
Multi-Doc Question Answering	LongBench-E Multi-Doc QA	F1 Score45.9	17
Single-Doc Question Answering	LongBench-E Single-Doc QA	F1 Score39.7	17

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord