Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

About

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

Yi-An Lai, Xuan Zhu, Yi Zhang, Mona Diab• 2020

Related benchmarks

Task	Dataset	Result
Classification	BBC	Accuracy66	61
Diversity measurement correlation	Instruction-tuning (IT) datasets LLaMA-3-8B performance	Pearson0.87	12
Diversity measurement correlation	Instruction-tuning (IT) datasets Qwen-2.5-7B performance	Average Correlation0.48	12
Classification	patents	Accuracy66	8
Classification	cnn	Accuracy66	8
Classification	arXiv	Accuracy66	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord