Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction
About
Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Reconstruction | Zhihu answers | BERTScore F199.07 | 65 | |
| Text Reconstruction | Chinese Wikipedia n = 200 (test) | BERTScore F197.63 | 65 | |
| Zero-shot text reconstruction | Chinese official news text n=200 chunks (test) | BERTScore F198.36 | 65 | |
| Text Reconstruction | BBC News | BERTScore F198.73 | 25 | |
| Text Reconstruction | BBC News (test) | BERTScore F199.67 | 15 | |
| Text Compression | Reddit conversational text n=200 (test) | BERTScore F1 (r_keep=0.9)99.46 | 14 | |
| Text Compression | Wikipedia Salesforce/wikitext (test) | BERTScore F1 (r_keep=0.9)0.9862 | 13 |