A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

About

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at https://github.com/LayTextLLM/LayTextLLM.

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang• 2024

Related benchmarks

Task	Dataset	Result
Document Visual Question Answering	DocVQA	ANLS77.2	301
Deepfake Detection	DFDC	AUC69.95	230
Deepfake Detection	DFD	AUC0.812	193
Image Deepfake Detection	DFo	AUC0.8638	62
Deepfake Detection	DFDCP	AUC0.7408	35
Deepfake Detection	FF++ Intra-dataset c23	AUC98.91	24
Document Visual Question Answering	SROIE	ANLS96.1	12
Deepfake Detection	CDF	AUC75.52	12
Document Visual Question Answering	CORD	ANLS82.5	12
Document Visual Question Answering	FUNSD	ANLS81	12

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord