Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

About

Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.

Mumin Jia, Jairo Diaz-Rodriguez• 2026

Related benchmarks

TaskDatasetResultRank
Text SegmentationChoi's Dataset (3-5)
Pk3.6
17
Text SegmentationChoi's Dataset (6-8)
Pk2.5
17
Text SegmentationChoi's Dataset 3-11
Pk5
17
Text SegmentationChoi's Dataset (9-11)
Pk3.1
17
Text SegmentationElements
Pk32.1
15
Text SegmentationWiki-50
Pk38
15
Text SegmentationWiki-300
Pk32.4
14
Text SegmentationarXiv
Pk7.9
13
Showing 8 of 8 rows

Other info

Follow for update