Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

About

Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document understanding that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunks to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analysis. Our implementation is open-sourced on https://github.com/adlnlp/Chulo.

Yan Li, Soyeon Caren Han, Yue Dai, Feiqi Cao• 2024

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionConll 2003--
86
Named Entity RecognitionGUM
Micro F195.74
36
Document ClassificationHP (test)
Accuracy95.38
10
Document ClassificationEURLEX57K (test)
Micro F173.32
8
Document ClassificationEURLEX57K Inverted (test)
Micro F172.44
7
Document ClassificationLUN (test)
Accuracy64.4
7
Named Entity RecognitionGUM (test)
Micro F195.55
5
Named Entity RecognitionGUM All (26) -> 512
Micro F195.55
5
Named Entity RecognitionCoNLL ALL (test)
Micro F193.34
5
Named Entity RecognitionCoNLL (>2048 tokens) (test)
Micro F193.25
5
Showing 10 of 17 rows

Other info

Code

Follow for update