LumberChunker: Long-Form Narrative Document Segmentation

About

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker

Andr\'e V. Duarte, Jo\~ao Marques, Miguel Gra\c{c}a, Miguel Freire, Lei Li, Arlindo L. Oliveira• 2024

Related benchmarks

Task	Dataset	Result
Document Retrieval	DUDE	--	32
Document Question Answering	M3DocVQA	Exact Match21.4	24
Document Question Answering	DUDE (test)	ANLS15.73	22
Question Answering	LiteraryQA	EM7.7	17
Retrieval	CUAD	Recall90.31	13
Question Answering	MOAMOB	ANLS25.36	13
Question Answering	CUAD	ANLS0.2657	13
Document-level retrieval	M3DocVQA (test)	Recall81.5	13
Document Question Answering	FRAMES	EM6.8	13
Document-level retrieval	FRAMES (test)	Recall64.8	13

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord