Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

About

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen• 2024

Related benchmarks

TaskDatasetResultRank
Frequency EstimationCAIDA streaming trace
Entropy Absolute Error1.42
64
Frequency EstimationKosarak
Entropy Absolute Error10.18
64
Frequency EstimationRetail
Entropy Absolute Error10.56
64
Showing 3 of 3 rows

Other info

Follow for update