DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

About

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain, existing approaches either demand significant computational resources or struggle with effective multi-modal integration. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs' input and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors. Code and model are available at https://github.com/whlscut/DocLayLLM.

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin• 2024

Related benchmarks

Task	Dataset	Result
Document Visual Question Answering	DocVQA	ANLS78.4	301
Document Visual Question Answering	DocVQA (test)	ANLS86.5	292
Information Visual Question Answering	InfoVQA (test)	ANLS58.4	130
Table Question Answering	WTQ (test)	Denotation Accuracy58.6	62
Document Visual Question Answering	VisualMRC	ANLS55	12
Document Visual Question Answering	FUNSD	ANLS84.1	12
Document Visual Question Answering	CORD	ANLS71.3	12
Document Visual Question Answering	SROIE	ANLS84.3	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord