Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

About

Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.

Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song• 2024

Related benchmarks

TaskDatasetResultRank
Lossless Compressionenwik9 (test)
Bits per Byte1.281
12
Lossless CompressionGutenberg (test)
Bits per Byte2.237
12
Lossless CompressionSpider (test)
bits/Byte2.27
12
Lossless CompressionGenoSeq (test)
Bits/Byte2.368
12
Lossless CompressionWikiSQL (test)
Bits/Byte1.712
12
Lossless CompressionDNACorpus (test)
Bits/Byte2.346
12
Data Compressionenwik9
Adjusted Bits/Byte2
7
Data CompressionGutenberg
Adjusted bits/Byte546
7
Data CompressionSpider
Adjusted bits/Byte6
7
Data CompressionWikiSQL
Adjusted bits/Byte18
7
Showing 10 of 12 rows

Other info

Follow for update