Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LumberChunker: Long-Form Narrative Document Segmentation

About

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker

Andr\'e V. Duarte, Jo\~ao Marques, Miguel Gra\c{c}a, Miguel Freire, Lei Li, Arlindo L. Oliveira• 2024

Related benchmarks

TaskDatasetResultRank
Document RetrievalDUDE--
32
Question AnsweringOmniEval
BLEU Score19.97
9
Question AnsweringHChemSafety
BLEU22.98
9
Question AnsweringCRUD
BLEU0.5061
9
Question AnsweringWebCPM (test)
ROUGE-L27.3
7
Question AnsweringDuReader (test)
F1 Score21.78
7
Question AnsweringCRUD Single-hop (test)
BLEU-10.3456
7
Question AnsweringCRUD Two-hop (test)
BLEU-122.04
7
RetrievalCUAD
Recall90.31
6
Question AnsweringMOAMOB
ANLS25.36
6
Showing 10 of 19 rows

Other info

Follow for update