Text Segmentation as a Supervised Learning Task

About

Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as clustering or graph search, due to the paucity in labeled data. In this work, we formulate text segmentation as a supervised learning problem, and present a large new dataset for text segmentation that is automatically extracted and labeled from Wikipedia. Moreover, we develop a segmentation model based on this dataset and show that it generalizes well to unseen natural text.

Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, Jonathan Berant• 2018

Related benchmarks

Task	Dataset	Result
Dialogue Segmentation	DialSeg711	Pk0.63	44
Dialogue Segmentation	TIAGE	Pk0.393	39
Dialogue Topic Segmentation	Doc2Dial	Pk28.5	34
Dialogue Topic Segmentation	SuperSeg	Pk Score24.1	28
Dialogue Topic Segmentation	VHF	Pk Score8.2	25
Text Segmentation	Wiki-50	Pk18.2	15
Text Segmentation	Elements	Pk41.6	15
Text Segmentation	WIKI-727K (test)	Precision69.3	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord