Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

About

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

Yi-An Lai, Xuan Zhu, Yi Zhang, Mona Diab• 2020

Related benchmarks

TaskDatasetResultRank
ClassificationBBC
Accuracy66
20
Diversity measurement correlationInstruction-tuning (IT) datasets LLaMA-3-8B performance
Pearson0.87
12
Diversity measurement correlationInstruction-tuning (IT) datasets Qwen-2.5-7B performance
Average Correlation0.48
12
Classificationpatents
Accuracy66
8
Classificationcnn
Accuracy66
8
ClassificationarXiv
Accuracy66
8
Showing 6 of 6 rows

Other info

Follow for update