L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression
About
Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Lossless Compression | enwik9 (test) | Bits per Byte1.281 | 12 | |
| Lossless Compression | Gutenberg (test) | Bits per Byte2.237 | 12 | |
| Lossless Compression | Spider (test) | bits/Byte2.27 | 12 | |
| Lossless Compression | GenoSeq (test) | Bits/Byte2.368 | 12 | |
| Lossless Compression | WikiSQL (test) | Bits/Byte1.712 | 12 | |
| Lossless Compression | DNACorpus (test) | Bits/Byte2.346 | 12 | |
| Data Compression | enwik9 | Adjusted Bits/Byte2 | 7 | |
| Data Compression | Gutenberg | Adjusted bits/Byte546 | 7 | |
| Data Compression | Spider | Adjusted bits/Byte6 | 7 | |
| Data Compression | WikiSQL | Adjusted bits/Byte18 | 7 |