Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

About

Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.

Mumin Jia, Jairo Diaz-Rodriguez• 2026

Related benchmarks

Task	Dataset	Result
Text Segmentation	arXiv	Pk7.9	22
Text Segmentation	Choi's Dataset (3-5)	Pk3.6	17
Text Segmentation	Choi's Dataset (6-8)	Pk2.5	17
Text Segmentation	Choi's Dataset 3-11	Pk5	17
Text Segmentation	Choi's Dataset (9-11)	Pk3.1	17
Text Segmentation	Elements	Pk32.1	15
Text Segmentation	Wiki-50	Pk38	15
Text Segmentation	Wiki-300	Pk32.4	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord