Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
About
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Segmentation | Choi's Dataset (3-5) | Pk3.6 | 17 | |
| Text Segmentation | Choi's Dataset (6-8) | Pk2.5 | 17 | |
| Text Segmentation | Choi's Dataset 3-11 | Pk5 | 17 | |
| Text Segmentation | Choi's Dataset (9-11) | Pk3.1 | 17 | |
| Text Segmentation | Elements | Pk32.1 | 15 | |
| Text Segmentation | Wiki-50 | Pk38 | 15 | |
| Text Segmentation | Wiki-300 | Pk32.4 | 14 | |
| Text Segmentation | arXiv | Pk7.9 | 13 |