LumberChunker: Long-Form Narrative Document Segmentation
About
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Retrieval | DUDE | -- | 32 | |
| Question Answering | OmniEval | BLEU Score19.97 | 9 | |
| Question Answering | HChemSafety | BLEU22.98 | 9 | |
| Question Answering | CRUD | BLEU0.5061 | 9 | |
| Question Answering | WebCPM (test) | ROUGE-L27.3 | 7 | |
| Question Answering | DuReader (test) | F1 Score21.78 | 7 | |
| Question Answering | CRUD Single-hop (test) | BLEU-10.3456 | 7 | |
| Question Answering | CRUD Two-hop (test) | BLEU-122.04 | 7 | |
| Retrieval | CUAD | Recall90.31 | 6 | |
| Question Answering | MOAMOB | ANLS25.36 | 6 |